This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 7 years ago.
I am working with a csv data set with around 1 million records. I need to perform two operations on the data set:
Prepare a dataset that do not have those rows that have some missing (blank) values in them.
Prepare another data set that replaces empty values with unknown.
I have tried to use excel for it but that is taking too much time. Please someone help with the way it can be done in R?
To get complete cases, use this:
complete_df <- df[complete.cases(df),]
complete.cases returns a logical vector that tells you which rows of dataframe df are complete, and you can use that to subset the data.
To replace the NAs, you can use this:
new_df <- df
new_df[is.na()] <- 'Unknown'
But this has the effect of possibly changing the datatypes of the columns with missing data. For example, if you have a column of numeric data and you put the missing variables as 'Unknown' then that whole column is now a character variable, so be aware of this.
Related
This question already has answers here:
Concatenate a vector of strings/character
(8 answers)
Closed last year.
I am trying to rename rows in my dataset. I need to change them like this: first row would be named "IL1", second "IL2",..., "ILn", where n is a number of rows in the dataset.
I know how to change it by for example rownames(df) <- c("IL1","IL2","IL3","IL4"). But type it word by word is possible only in smaller datasets. I need to change it in dataset where are hundreds of rows.
Any ideas? Thank you.
You can use paste0:
rownames(df) <- paste0("IL", 1:nrow(df))
This question already has answers here:
Delete rows containing specific strings in R
(7 answers)
How to import multiple .csv files at once?
(15 answers)
Closed 1 year ago.
In my data huge .csv dataframe, I have merged 100+ csvs through cmd. This includes the headers. Now, I wish to delete the following duplicated header from my master csv in R:
Year|RecID|ParID|ConParID|Country|Division|RegCnty|RegDist|SubDist|RC|RD|RSD|Parish|Area|Part|Population|MalePop|FemalePop|NoOfInstit|InstitPop|ParType|Censusref|ImageRef|PageType|DocType|EnuDist|BuildType|BTCode|NoOfRooms|NoOfRoomsCode|Schedule|H|Absent|Absentcode|HSS|InstName|InstDesc|VessName|VessPos|PID|Sex|SexInf|Age|Cage|AgeInf|Cond|Mar|MarInf|Relat|Rela|RelInf|HeadInf|Occ|HollerOcc|Occode|HISCO|Industry|HollerInd|Employ|EmployCode|AtHome|Inactive|Disab|DisCode1|DisCode2|Bpstring|BpCmty|Std_Par|BpCnty|Cnti|Alt_Cnti|BpCtry|Ctry|Alt_Ctry|HollerB|Nationality|Lang|Langcode|YearsMar|MarYear|ChildTot|ChildAlive|ChildDead|ChildrenCode|HHD|H_Sex|H_Age|H_Rela|H_Mar|H_Occ|H_CFU|SameName|CFU|n_CFUs|tn_CFUs|CFUsize|Spouse|Father|Mother|f_Off|m_Off|m_Offm|f_Offm|Offsp|Kids|Relats|Inmates|Servts|Non_Rels|Visitors|Military
This header appears as many times as there were initial csv files, and not at a regular interval. How can I select all rows containing this header to include it in the following code:
myData <- myData[-c(...)]
Any help appreciated or other alternative solutions. It's big data, so I cannot open and remove duplicates in excel.
Instead of merging them in cmd it is advised to do so in R, as merging all data (UNION) along with header as rows, will result in changing the column type to strings everywhere and you'll have to do a lot of work to change their types all over again. See this answer for complete help as to how merge these in R itself.
If still you have a merged data that you don't want to repeat the steps all over again, you can remove the header rows in R by this command.
Obviously Year column won't have value Year except in header rows so do this
myData <- myData[myData$Year != 'Year',]
myData$Year != 'Year' will have True for only meaningful rows and replace myData with subset of these meaninful i.e. (non-header) rows only.
If you have Year column values equal to 'Year' anywhere use this logic on some other column
This question already has answers here:
How to find the mean of a column in R [duplicate]
(2 answers)
Closed 2 years ago.
I want to find out mean of each column in my dataset, which contains null / blank values.
I've attached screenshots of actual and sample data for reference.
I don't see the data?
Usually, you just have to calculate mean() of extracted column from a data frame if a column is numerical. And you become immediate data frame with importing an excel file in rstudio.
It's is easier to work with the data frame if you name your columns.
dataframe_name <- c(column1, column2, column3)
Then, you can easily extract the mean of a column.
mean(dataframe_name$column1)
library(tidyverse)
df %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
This calculates the mean for every numeric column in your dataframe.
Best thing to do, is to add a helper column, using this formula:
=IfError(B1;0)
(Obviously you might need to use another cell reference)
This formula replaces all error values by zero, you can use this column as an input for calculating your averages.
There are two simple ways that can solve your problem.
Set up a separate table next to the one you use and take the mean of the respective cells:
=IfError(Cell;0)
=MEAN(StartCell:EndCell)
Replace the null values with 0 by using the replace all function and take the mean value afterwards.
Note: Both approaches will take into consideration the zeros when calculating the mean. If you want to avoid this, replace the null values with "nothing". Hope that helps.
This question already has answers here:
How to do vlookup and fill down (like in Excel) in R?
(9 answers)
Closed 5 years ago.
I am trying to apply an operation which applies on each row of a dataframe. Below is the code.
for(i in 1:nrow(df)){
df$final[i] <- alignfile[alignfile$Response == df$value[i],]$Aligned
}
It is basically doing the vlookup from "alignfile" data frame and making a new column with the successful vlookup of "value" column in data frame "df".
How do i replace this operation with apply family of function so that i can get rid of for loops which is making it slow.
Looking for suggestions. Please feel free for more clarifications.
Thanks
You didn't provide a reproducible example so take my answer with a grain of salt,
I think you don't need to use a for loop at all in this case (as in most cases with R) and neither an apply function. I think this problem could be easily solved with an ifelse in the following way:
df$final <- ifelse(alignfile$Response==df$value, 1, 0)
this will put a one in the final column of the df dataframe if the value in the current cell of the alignfile$Response column is equal to the value of the current cell in the df$value column. This assumes alignfile and df have the same number of rows (as it appears from the code you provided).
This question already has answers here:
Convert data.frame columns from factors to characters
(18 answers)
Closed 9 years ago.
I am having a heck of a time with factors injecting themselves in code where they are not preferred.
How do you remove all factors from a matrix? a vector? a data.frame?
Question update below
I thought the question would be general enough, but it is clearly not.
Factors creep in when using melt so I am looking for a way to remove the factors after I have executed the melt command. As you see from the example code below, the factor approach (not sure what to call that) enters for column 3. I presume it is because this column is text. I need to remove this factor because I am retrieving data from a matrix so a factor of 3 is meaningless (in this scenario).
names(airquality) <- tolower(names(airquality))
data <- melt(airquality, id=c("month", "day"))
is.factor(data[,3])
If you want to convert a specific data frame to be factor free, I would refer to here: Convert data.frame columns from factors to characters
dataframe is named bob:
bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)
Also, if you want to read in a specific data frame and have no factors in it from the start, you can write:
file <- read.table(pathtoFile, stringsAsFactors=F)