Filling specific duplicated values within the rows of a dataframe with NAs - r

For each row of my dataframe, I am currently trying to select all the duplicated values equal to 4 in order to set them "equal" to NA.
My dataframe is like this:
dat <- read.table(text = "
1 1 1 2 2 4 4 4
1 2 1 1 4 4 4 4",
header=FALSE)
What I need to obtain is:
1 1 1 2 2 4 NA NA
1 2 1 1 4 NA NA NA
I have found information on how to eliminate duplicated rows or columns, but I really do not know how to proceed here.. many thanks for any help

Sometimes you will want to avoid apply because it destroys the multi-class feature of dataframe objects. This is a by approach:
> do.call(rbind, by(dat, rownames(dat),
function(line) {line[ duplicated(unlist(line)) & line==4 ] <- NA; line} ) )
V1 V2 V3 V4 V5 V6 V7 V8
1 1 1 1 2 2 4 NA NA
2 1 2 1 1 4 NA NA NA

which and apply are helpful here.
> dat <- t(apply(dat, 1, function(X) {X[which(X==4)][-1] <- NA ; X}))
> dat
[1,] 1 1 1 2 2 4 NA NA
[2,] 1 2 1 1 4 NA NA NA
But there's probably a way around having to use the transpose (t) function here, can anyone help me out?

duplicated can be used in this way with an apply:
dat <- t(apply(dat, 1, function(x) {x[duplicated(x) & x == 4] <- NA ; x}))

Related

Replace values within a range in a data frame in R

I have ranked rows in a data frame based on values in each column.Ranking 1-10. not every column in picture
I have code that replaces values to NA or 1. But I can't figure out how to replace range of numbers, e.g. 3-6 with 1 and then replace the rest (1-2 and 7-10) with NA.
lag.rank <- as.matrix(lag.rank)
lag.rank[lag.rank > n] <- NA
lag.rank[lag.rank <= n] <- 1
At the moment it only replaces numbers above or under n. Any suggestions? I figure it should be fairly simple?
Is this what your are trying to accomplish?
> x <- sample(1:10,20, TRUE)
> x
[1] 1 2 8 2 6 4 9 1 4 8 6 1 2 5 8 6 9 4 7 6
> x <- ifelse(x %in% c(3:6), 1, NA)
> x
[1] NA NA NA NA 1 1 NA NA 1 NA 1 NA NA 1 NA 1 NA 1 NA 1
If your data aren't integers but numeric you can use between from the dplyr package:
x <- ifelse(between(x,3,6), 1, NA)

How do I lag a data.frame?

I'd like to lag whole dataframe in R.
In python, it's very easy to do this, using shift() function
(ex: df.shift(1))
However, I could not find any as an easy and simple method as in pandas shift() in R.
How can I do this?
> x = data.frame(a=c(1,2,3),b=c(4,5,6))
> x
a b
1 1 4
2 2 5
3 3 6
What I want is,
> lag(x,1)
>
a b
1 NA NA
2 1 4
3 2 5
Any good idea?
Pretty simple in base R:
rbind(NA, head(x, -1))
a b
1 NA NA
2 1 4
3 2 5
head with -1 drops the final row and rbind with NA as the first argument adds a row of NAs.
You can also use row indexing [, like this
x[c(NA, 1:(nrow(x)-1)),]
a b
NA NA NA
1 1 4
2 2 5
This leaves an NA in the row name of the first variable, to "fix" this, you can strip the data.frame class and then reassign it:
data.frame(unclass(x[c(NA, 1:(nrow(x)-1)),]))
a b
1 NA NA
2 1 4
3 2 5
Here, you can use rep to produce the desired lags
data.frame(unclass(x[c(rep(NA, 2), 1:(nrow(x)-2)),]))
a b
1 NA NA
2 NA NA
3 1 4
and even put this into a function
myLag <- function(dat, lag) data.frame(unclass(dat[c(rep(NA, lag), 1:(nrow(dat)-lag)),]))
Give it a try
myLag(x, 2)
a b
1 NA NA
2 NA NA
3 1 4
library(dplyr)
x %>% mutate_all(lag)
a b
1 NA NA
2 1 4
3 2 5
Just for completeness this would be analogous to how zoo implements it (but for a data.frame since the zoo lag(...) method doesn't work on data.frame objects):
lag.df <- function(x, lag) {
if (lag < 0)
rbind(NA, head(x, lag))
else
rbind(tail(x, -lag), NA)
}
and use like this:
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
lag.df(x, -1)
lag.df(x, 1)
or you can just use zoo:
library(zoo)
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
x.zoo <- read.zoo(x)
lag(x.zoo, -1)
lag(x.zoo, 1)

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

Count occurrences of value in a set of variables in R (per row)

Let's say I have a data frame with 10 numeric variables V1-V10 (columns) and multiple rows (cases).
What I would like R to do is: For each case, give me the number of occurrences of a certain value in a set of variables.
For example the number of occurrences of the numeric value 99 in that single row for V2, V3, V6, which obviously has a minimum of 0 (none of the three have the value 99) and a maximum of 3 (all of the three have the value 99).
I am really looking for an equivalent to the SPSS function COUNT: "COUNT creates a numeric variable that, for each case, counts the occurrences of the same value (or list of values) across a list of variables."
I thought about table() and library plyr's count(), but I cannot really figure it out. Vectorized computation preferred. Thanks a lot!
If you need to count any particular word/letter in the row.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,L),V2=c(1,L,2,2,L),
V3=c(1,2,2,1,L), V4=c(L, L, 1,2, L))
For counting number of L in each row just use
#This is how to compute a new variable counting occurences of "L" in V1-V4.
df$count.L <- apply(df, 1, function(x) length(which(x=="L")))
The result will appear like this
> df
V1 V2 V3 V4 count.L
1 1 1 1 L 1
2 1 L 2 L 2
3 2 2 2 1 0
4 1 2 1 2 0
I think that there ought to be a simpler way to do this, but the best way that I can think of to get a table of counts is to loop (implicitly using sapply) over the unique values in the dataframe.
#Some example data
df <- data.frame(a=c(1,1,2,2,3,9),b=c(1,2,3,2,3,1))
df
# a b
#1 1 1
#2 1 2
#3 2 3
#4 2 2
#5 3 3
#6 9 1
levels=unique(do.call(c,df)) #all unique values in df
out <- sapply(levels,function(x)rowSums(df==x)) #count occurrences of x in each row
colnames(out) <- levels
out
# 1 2 3 9
#[1,] 2 0 0 0
#[2,] 1 1 0 0
#[3,] 0 1 1 0
#[4,] 0 2 0 0
#[5,] 0 0 2 0
#[6,] 1 0 0 1
Try
apply(df,MARGIN=1,table)
Where df is your data.frame. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.
For instance:
df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]
10 20
1 2
[[2]]
10 20 30
1 1 1
[[3]]
10 20
1 2
[[4]]
10 20 30
1 1 1
#desired result
Here is another straightforward solution that comes closest to what the COUNT command in SPSS does — creating a new variable that, for each case (i.e., row) counts the occurrences of a given value or list of values across a list of variables.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
#This is how to compute a new variable counting occurences of value "1" in V1-V4.
df$count.1 <- apply(df, 1, function(x) length(which(x==1)))
The updated data frame contains the new variable count.1 exactly as the SPSS COUNT command would do.
> df
V1 V2 V3 V4 count.1
1 1 1 1 NA 3
2 1 NA 2 NA 1
3 2 2 2 1 1
4 1 2 1 2 2
5 NA NA NA NA 0
You can do the same to count how many time the value "2" occurs per row in V1-V4. Note that you need to select the columns (variables) in df to which the function is applied.
df$count.2 <- apply(df[1:4], 1, function(x) length(which(x==2)))
You can also apply a similar logic to count the number of missing values in V1-V4.
df$count.na <- apply(df[1:4], 1, function(x) sum(is.na(x)))
The final result should be exactly what you wanted:
> df
V1 V2 V3 V4 count.1 count.2 count.na
1 1 1 1 NA 3 0 1
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 0
4 1 2 1 2 2 2 0
5 NA NA NA NA 0 0 4
This solution can easily be generalized to a range of values.
Suppose we want to count how many times a value of 1 or 2 occurs in V1-V4 per row:
df$count.1or2 <- apply(df[1:4], 1, function(x) sum(x %in% c(1,2)))
A solution with functions from the dplyr package would be the following:
Using the example data set from LechAttacks answer:
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
Count the appearances of "1" and "2" each and both combined:
df %>%
rowwise() %>%
mutate(count_1 = sum(c_across(V1:V4) == 1, na.rm = TRUE),
count_2 = sum(c_across(V1:V4) == 2, na.rm = TRUE),
count_12 = sum(c_across(V1:V4) %in% 1:2, na.rm = TRUE)) %>%
ungroup()
which gives the table:
V1 V2 V3 V4 count_1 count_2 count_12
1 1 1 1 NA 3 0 3
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 4
4 1 2 1 2 2 2 4
5 NA NA NA NA 0 0 0
In my effort to find something similar to Count from SPSS in R is as follows:
`df <- data.frame(a=c(1,1,NA,2,3,9),b=c(1,2,3,2,NA,1))` #Dummy data with NAs
`df %>%
dplyr::mutate(count = rowSums( #this allows calculate sum across rows
dplyr::select(., #Slicing on .
dplyr::one_of( #within select use one_of by clarifying which columns your want
c('a','b'))), na.rm = T)) #once the columns are specified, that's all you need, na.rm is cherry on top
That's how the output looks like
>df
a b count
1 1 1 2
2 1 2 3
3 NA 3 3
4 2 2 4
5 3 NA 3
6 9 1 10
Hope it helps :-)

R- Perform operations on column and place result in a different column, with the operation specified by the output column's name

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)
It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3
dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

Resources