Setting values to NA in a dataframe in R - r

Here is some reproducible code that shows the problem I am trying to solve in another dataset. Suppose I have a dataframe df with some NULL values in it. I would like to replace these with NAs, as I attempt to do below. But when I print this, it comes out as <NA>. See the second dataframe, which comes is the dataframe I would like to produce from df, in which the NA is a regular old NA without the carrots.
> df = data.frame(a=c(1,2,3,"NULL"),b=c(1,5,4,6))
> df[4,1] = NA
> print(df)
a b
1 1 1
2 2 5
3 3 4
4 <NA> 6
>
> d = data.frame(a=c(1,2,3,NA),b=c(1,5,4,6))
> print(d)
a b
1 1 1
2 2 5
3 3 4
4 NA 6

Related

How do I lag a data.frame?

I'd like to lag whole dataframe in R.
In python, it's very easy to do this, using shift() function
(ex: df.shift(1))
However, I could not find any as an easy and simple method as in pandas shift() in R.
How can I do this?
> x = data.frame(a=c(1,2,3),b=c(4,5,6))
> x
a b
1 1 4
2 2 5
3 3 6
What I want is,
> lag(x,1)
>
a b
1 NA NA
2 1 4
3 2 5
Any good idea?
Pretty simple in base R:
rbind(NA, head(x, -1))
a b
1 NA NA
2 1 4
3 2 5
head with -1 drops the final row and rbind with NA as the first argument adds a row of NAs.
You can also use row indexing [, like this
x[c(NA, 1:(nrow(x)-1)),]
a b
NA NA NA
1 1 4
2 2 5
This leaves an NA in the row name of the first variable, to "fix" this, you can strip the data.frame class and then reassign it:
data.frame(unclass(x[c(NA, 1:(nrow(x)-1)),]))
a b
1 NA NA
2 1 4
3 2 5
Here, you can use rep to produce the desired lags
data.frame(unclass(x[c(rep(NA, 2), 1:(nrow(x)-2)),]))
a b
1 NA NA
2 NA NA
3 1 4
and even put this into a function
myLag <- function(dat, lag) data.frame(unclass(dat[c(rep(NA, lag), 1:(nrow(dat)-lag)),]))
Give it a try
myLag(x, 2)
a b
1 NA NA
2 NA NA
3 1 4
library(dplyr)
x %>% mutate_all(lag)
a b
1 NA NA
2 1 4
3 2 5
Just for completeness this would be analogous to how zoo implements it (but for a data.frame since the zoo lag(...) method doesn't work on data.frame objects):
lag.df <- function(x, lag) {
if (lag < 0)
rbind(NA, head(x, lag))
else
rbind(tail(x, -lag), NA)
}
and use like this:
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
lag.df(x, -1)
lag.df(x, 1)
or you can just use zoo:
library(zoo)
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
x.zoo <- read.zoo(x)
lag(x.zoo, -1)
lag(x.zoo, 1)

Create a counter in a for loop in R

I'm an unexperienced user of R and I need to create quite a complicated stuff.
My dataset looks like this :
dataset
a,b,c,d,e are different individuals.
I want to complete the D column as follows :
At the last line for each individual in the col A, D = sum(C)/(B-1).
Expected results should look like :
results
D4=sum(C2:C4)/(B4-1)=0.5
D6=sum(C5:C6)/(B6-1)=1, etc.
I attempted to deal with it with something like :
for(i in 2:NROW(dataset)){
dataset[i,4]<-ifelse(
(dataset[i,1]==data1[i-1,1]),sum(dataset[i,3])/(dataset[i,2]-1),NA
)
}
But it is obviously not sufficient, as it computes the D value for all the rows and not only the last for each individual, and it does not calculate the sum of C values for this individual.
And I really don't know how to figure it out. Do you guys have any advice ?
Many thanks.
If I understood your question correctly, then this is one approach to get to the desired result:
df <- data.frame(
A=c("a","a","a","b","b","c","c","c","d","e","e"),
B=c(3,3,3,2,2,3,3,3,1,2,2),
C=c(NA,1,0,NA,1,NA,0,1,NA,NA,0),
stringsAsFactors = FALSE)
for(i in 2:NROW(df)){
df[i,4]<-ifelse(
(df[i,1]!=df[i+1,1] | i == nrow(df)),sum(df[df$A == df[i,1],]$C, na.rm=TRUE)/(df[i,2]-1),NA
)
}
This code results in the following table:
A B C V4
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0
The ifelse first tests if the individual of the current row of column A is different than the individual in the next row OR if it's the last row.
If it is the last row with this individual it takes the sum of column C (ignoring the NAs) of the rows with the individual present in column A divided by the value in column B minus one.
Otherwise it puts an NA in the fourth column.
Using dplyr you can try generating D for all rows and then remove where not required:
dftest %>%
group_by(A,B) %>%
dplyr::mutate(D = sum(C, na.rm=TRUE)/(B-1)) %>%
dplyr::mutate(D = if_else(row_number()== n(), D, as.double(NA)))
which gives:
Source: local data frame [11 x 4]
Groups: A, B [5]
A B C D
<chr> <dbl> <dbl> <dbl>
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0

How to change characters into NA?

I have a census dataset with some missing variables indicated with a ?,
When checking for incomplete cases in R it says there are none because R takes the ? as a valid character. Is there any way to change all the ? to NAs? I would like to run multiple imputation using the mice package to fill in the missing data after.
Data frames. You may need to fiddle with the quotation marks. I have not tested this.
df[df == "?"] <- NA
Creating data frame df
df <- data.frame(A=c("?",1,2),B=c(2,3,"?"))
df
# A B
# 1 ? 2
# 2 1 3
# 3 2 ?
I. Using replace() function
replace(df,df == "?",NA)
# A B
# 1 <NA> 2
# 2 1 3
# 3 2 <NA>
II. While importing a file with ?
data <- read.table("xyz.csv",sep=",",header=T,na.strings=c("?",NA))
data
# A B
# 1 1 NA
# 2 2 3
# 3 3 4
# 4 NA NA
# 5 NA NA
# 6 4 5

How to remove rows in R with 0 and get averages of remaining rows

I would like to remove all 0's in my data and get the mean across columns for remaining values.
dataframe <- ("file.csv")
data_list = lapply (dataframe, read.table, header=TRUE)
My dataframe looks like this:
A B C
1 0 2
2 1 0
3 3 5
4 7 6
I would like my dataframe to look like this
A B C
3 3 5
4 7 6
I tried
dataframe[apply(dataframe[c(1:3)],1,function(z) !any(z==0)),]
and got this error
Error in apply(dataframe[c(1:3)], 1, function(z) !any(z == 0)) :
dim(X) must have a positive length
Additionally, I would like to get the mean of the remaining rows. Again, I am new to scripting altogether and am lost on what to do. I will give more information as needed but this is my full script for now
To remove all the rows that contain zero, you can do
df[!rowSums(df == 0), ]
# A B C
# 3 3 3 5
# 4 4 7 6
For the row means of the remaining rows,
rowMeans(df[!rowSums(df == 0), ])
# 3 4
# 3.666667 5.666667
where df is
df <- read.table(text="A B C
1 0 2
2 1 0
3 3 5
4 7 6", header=TRUE)

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources