keep NA and blanks rows in a data.frame - r

I have this dataset:
ID FARM WEIGHT
1 2 NA
2 2
3 3 57
4 4 58
5 7 NA
And I desire select the blank and NA rows, I need my data.frame this way:
ID FARM WEIGHT
1 2 NA
2 2
5 7 NA
I tried this code:
newfile <- dataset[!(is.na(dataset$WEIGHT) | dataset$WEIGHT != ''),]
but doesn't work, I obtained an empty dataset.

I tried you code, shouldn't you use dataset[is.na(dataset$WEIGHT) | dataset$WEIGHT=="",]? The following code works.
dataset <- data.frame(ID=1:5, FARM=c(2, 2, 3, 4, 7), WEIGHT=c(NA, "", "57", "58", NA) )
dataset[is.na(dataset$WEIGHT) | dataset$WEIGHT=="",]
# ID FARM WEIGHT
# 1 1 2 <NA>
# 2 2 2
# 5 5 7 <NA>

Just use-
dt[!complete.cases(dt), ]
OR
dt[rowSums(is.na(dt) | dt=="") > 0,]
Output-
ID FARM WEIGHT
1 1 2 NA
2 2 2 NA
5 5 7 NA
Note- If you want to read directly from file then you can also do-
dt<- read.csv("file.csv", na.strings=c("NA",""))

Related

How do we avoid for-loops when we want to conditionally add columns by reference? (condition to be evaluated seperately in each row)

I have a data.table with many numbered columns. As a simpler example, I have this:
dat <- data.table(cbind(col1=sample(1:5,10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(1:5,10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')
I want to create a new column as follows: In each row, we add the values in columns from among col1-col4 if the value is not NA or 1.
My current code for this has two for-loops which is clearly not the way to do it:
for(i in 1:nrow(dat)){
dat[i,'sumCol':={temp=0;
for(j in 1:4){if(!is.na(dat[i,paste0('col',j),with=F])&
dat[i,paste0('col',j),with=F]!=1
){temp=temp+dat[i,paste0('col',j),with=F]}};
temp}]}
I would appreciate any advice on how to remove this for-loops. My code is running on a bigger data.table and it takes a long time to run.
A possible solution:
dat[, sumCol := rowSums(.SD * (.SD != 1), na.rm = TRUE), .SDcols = col1:col4]
which gives:
> dat
col1 col2 col3 col4 oneMoreCol sumCol
1: 4 5 5 3 a 17
2: 4 5 NA 5 a 14
3: 2 3 4 3 a 12
4: 1 2 3 4 a 9
5: 4 3 NA 5 a 12
6: 2 2 1 4 a 8
7: NA 2 NA 5 a 7
8: 4 2 2 4 a 12
9: 4 1 5 4 a 13
10: 2 1 5 1 a 7
Used data:
set.seed(20200618)
dat <- data.table(cbind(col1=sample(c(NA, 1:5),10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(c(1:5,NA),10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')

R given a list of dataframes, how to add a new column to all rows in dataframes

I have a case like following: there is a list of dataframes
class(cc.purc$items) => "list"
length(cc.purc$items) => 970
class(cc.purc$items[[1]]) => "data.frame"
head(cc.purc$items, 2)
[[1]]
barcode quantity price amount grams litres
1 abc 1 1.00 1.00 NA NA
2 xyz 1 1.29 1.29 NA NA
[[2]]
barcode quantity price amount grams litres
1 abc2 1 5.5 5.5 NA NA
2 xyz2 -1 19.5 -19.5 NA NA
cc.purc has a field called "transaction_id" one for each dataframe in the "items" list.
head(cc.purc$transaction_id, 2) => "62740" "62741"
I want to print all rows in all dataframes contained in list and add corresponding transaction_id as additional column to all rows.
Ex: want following
barcode quantity price amount grams litres tran_id
1 abc 1 1.00 1.00 NA NA 62740
2 xyz 1 1.29 1.29 NA NA 62740
3 abc2 1 5.5 5.5 NA NA 62741
4 xyz2 -1 19.5 -19.5 NA NA 62741
How to achieve this? Please help.
To get all rows from all DF in list "items" I can do following:
do.call("rbind", cc.purc$items)
But how to add corresponding column (transaction_id) to all relevant rows is what I am not able to figure out?
You can use Map() to loop over the data and the transaction IDs simultaneously.
Example data:
test <- list(data = list(data.frame(var1 = rnorm(4),
var2 = runif(4)),
data.frame(var1 = rnorm(4),
var2 = runif(4)),
data.frame(var1 = rnorm(4),
var2 = runif(4))),
tran_id = c(1:3))
# add new column to every dataframe
test$data <- Map(function(x, y){
x$tran_id <- y
return(x)
}, test$data, test$tran_id)
Result:
> test
$data
$data[[1]]
var1 var2 tran_id
1 0.99943735 0.57436983 1
2 -0.04483769 0.29832753 1
3 1.89678549 0.81138668 1
4 -0.58839397 0.07071112 1
$data[[2]]
var1 var2 tran_id
1 -0.018843434 0.84813495 2
2 -0.258920304 0.09818365 2
3 -0.009920782 0.07873543 2
4 0.833070609 0.47808518 2
$data[[3]]
var1 var2 tran_id
1 1.21224941 0.3587937 3
2 -0.65107256 0.9727788 3
3 1.54107062 0.8444594 3
4 -0.09976177 0.6034762 3
$tran_id
[1] 1 2 3
Bind the data together:
> do.call("rbind", test$data)
var1 var2 tran_id
1 0.999437345 0.57436983 1
2 -0.044837689 0.29832753 1
3 1.896785487 0.81138668 1
4 -0.588393971 0.07071112 1
5 -0.018843434 0.84813495 2
6 -0.258920304 0.09818365 2
7 -0.009920782 0.07873543 2
8 0.833070609 0.47808518 2
9 1.212249412 0.35879366 3
10 -0.651072562 0.97277883 3
11 1.541070621 0.84445938 3
12 -0.099761769 0.60347619 3

R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe

My data frame:
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
For the case A and qA (= quality A): I want the values assigned to the quality value 1 and 3 are replaced by NA
And the same for the case B and qB
The final data has to be like this:
desired_data <- data.frame(A = c("NA",5,6,"NA","NA"), qA = c(1,2,2,3,1), B = c(2,5,"NA","NA","NA"), qB = c(2,2,1,3,1))
My question is how to perform that?
I have a big dataframe with about 90 columns, so I need code which doesn't require the column names to work properly.
To help, I have this part of code which select columns starting with "q" letter:
data[,grep("^[q]", colnames(data))]
You could just do this...
data[,seq(1,ncol(data),2)][(data[,seq(2,ncol(data),2)]==1)|
(data[,seq(2,ncol(data),2)]==3)] <- NA
data
A qA B qB
1 NA 1 2 2
2 5 2 5 2
3 6 2 NA 1
4 NA 3 NA 3
5 NA 1 NA 1
One solution is to separate in two tables and use vectorisation in base R
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
data
#> A qA B qB
#> 1 1 1 2 2
#> 2 5 2 5 2
#> 3 6 2 6 1
#> 4 8 3 8 3
#> 5 7 1 4 1
quality <- data[,grep("^[q]", colnames(data))]
data2 <- data[,setdiff(colnames(data), names(quality))]
data2[quality == 1 | quality == 3] <- NA
data2
#> A B
#> 1 NA 2
#> 2 5 5
#> 3 6 NA
#> 4 NA NA
#> 5 NA NA

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

R- Perform operations on column and place result in a different column, with the operation specified by the output column's name

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)
It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3
dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

Resources