comparing values in a row - r

I am trying to compare values on data frame rows, and removing all the ones that match, with this
dat[!dat[1]==dat[2]]
where
> dat
returns
n1 n2
n1 n4
n4 n5
n1 n3
n4 n4
So i want it to compare the values and delete the last row, since both columns have the same data. But when i use the above code, it tells me
Error in Ops.factor(left, right) : level sets of factors are different
the str(dat) reads
'data.frame': 5 obs. of 2 variables:
$ V1: Factor w/ 2 levels "n1","n4": 1 1 2 1 2
$ V2: Factor w/ 4 levels "n2","n3","n4",..: 1 3 4 2 3

I suspect in the creation of your data, you inadvertently and implicitly converted your columns to factors. This possibly happened when you read the data from source, e.g. when using read.csv or read.table. This example illustrates it:
dat <- read.table(text="
n1 n2
n1 n4
n4 n5
n1 n3
n4 n4")
str(dat)
'data.frame': 5 obs. of 2 variables:
$ V1: Factor w/ 2 levels "n1","n4": 1 1 2 1 2
$ V2: Factor w/ 4 levels "n2","n3","n4",..: 1 3 4 2 3
The remedy is to pass the argument stringsAsFactors=FALSE to read.table():
dat <- read.table(text="
n1 n2
n1 n4
n4 n5
n1 n3
n4 n4", stringsAsFactors=FALSE)
str(dat)
'data.frame': 5 obs. of 2 variables:
$ V1: chr "n1" "n1" "n4" "n1" ...
$ V2: chr "n2" "n4" "n5" "n3" ...
Then your code works (except that I suspect you've missed a comma):
dat[!dat[1]==dat[2], ]
V1 V2
1 n1 n2
2 n1 n4
3 n4 n5
4 n1 n3

One solution would be to instruct the data frame not convert character vectors into factors (using stringAsFactors=F):
x <- c('n1', 'n1', 'n4', 'n1', 'n4')
y <- c('n2', 'n4', 'n5', 'n3', 'n4')
df <- data.frame(x, y, stringsAsFactors=F)
df <- df[-which(df$x == df$y), ]
After creating the data frame the code removes the matching rows, producing the result you wanted.

Related

How do I lookup the closest row in a lookup data.frame based on multiple columns?

Suppose I have the following dataframe with data (v) and a lookup dataframe (l):
v <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-05'), as.Date('2019-01-30'), as.Date('2019-02-02')), kind=c('a', 'b', 'c', 'a'), v1=c(1,2,3,4))
v
d kind v1
1 2019-01-01 a 1
2 2019-01-05 b 2
3 2019-01-30 c 3
4 2019-02-02 a 4
l <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-04'), as.Date('2019-02-01')), kind=c('a','b','a'), l1=c(10,20,30))
l
d kind l1
1 2019-01-01 a 10
2 2019-01-04 b 20
3 2019-02-01 a 30
I would like to find the closest row in the l dataframe corresponding to each row in v using the columns: c("d", "kind"). Column kind needs to match exactly and maybe use findInterval(...) on d?
I would like my result to be:
d kind v1 l1
1 2019-01-01 a 1 10
2 2019-01-05 b 2 20
3 2019-01-30 c 3 NA
4 2019-02-02 a 4 30
NOTE: I would prefer a base-R implementation but it would be
interesting to see others
I tried findInterval(...) but I don't know how get it to work with multiple columns
Here's a shot in base-R only. (I do believe that data.table will do this much more elegantly, but I appreciate your aversion to bring in other packages.)
Split each frame into a list of frames, by kind:
v_spl <- split(v, v$kind)
l_spl <- split(l, l$kind)
str(v_spl)
# List of 3
# $ a:'data.frame': 2 obs. of 3 variables:
# ..$ d : Date[1:2], format: "2019-01-01" "2019-02-02"
# ..$ kind: Factor w/ 3 levels "a","b","c": 1 1
# ..$ v1 : num [1:2] 1 4
# $ b:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-05"
# ..$ kind: Factor w/ 3 levels "a","b","c": 2
# ..$ v1 : num 2
# $ c:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-30"
# ..$ kind: Factor w/ 3 levels "a","b","c": 3
# ..$ v1 : num 3
Now we determine the unique kind we have in common between the two, no need to try to join everything:
### this has the 'kind' in common
(nms <- intersect(names(v_spl), names(l_spl)))
# [1] "a" "b"
### this has the 'kind' we have to bring back in later
(miss_nms <- setdiff(names(v_spl), nms))
# [1] "c"
For the in-common kind, do an interval join:
joined <- Map(
v_spl[nms], l_spl[nms],
f = function(v0, l0) {
ind <- findInterval(v0$d, l0$d)
ind[ ind < 1 ] <- NA
v0$l1 <- l0$l1[ind]
v0
})
Ultimately we will rbind things back together, but those in miss_nms will not have the new column(s). This is a generic way to capture exactly one row of the new columns with an appropriate NA value:
emptycols <- joined[[1]][, setdiff(colnames(joined[[1]]), colnames(v)),drop=FALSE][1,,drop=FALSE][NA,,drop=FALSE]
emptycols
# l1
# NA NA
And add that column(s) to the not-yet-found frames:
unjoined <- lapply(v_spl[miss_nms], cbind, emptycols)
unjoined
# $c
# d kind v1 l1
# 3 2019-01-30 c 3 NA
And finally bring everything back into a single frame:
do.call(rbind, c(joined, unjoined))
# d kind v1 l1
# a.1 2019-01-01 a 1 10
# a.4 2019-02-02 a 4 30
# b 2019-01-05 b 2 20
# c 2019-01-30 c 3 NA
If you want an exact match you would go:
vl <- merge(v, l, by = c("d","kind"))
For your purposes, you can transform d into additional variables for year, month or day and use the merge

append to dataframe in function - is globalenv really required

I am using the following code, which works fine (improvement suggestions very much welcome):
WeeklySlopes <- function(Year, Week){
DynamicQuery <- paste('select DayOfYear, Week, Year, Close from SourceData where year =', Year, 'and week =', Week, 'order by DayOfYear')
SubData = sqldf(DynamicQuery)
SubData$X <- as.numeric(rownames(SubData))
lmfit <- lm(Close ~ X, data = SubData)
lmfit <- tidy(lmfit)
Slope <- as.numeric(sqldf("select estimate from lmfit where term = 'X'"))
e <- globalenv()
e$WeeklySlopesDf[nrow(e$WeeklySlopesDf) + 1,] = c(Year,Week, Slope)
}
WeeklySlopesDf <- data.frame(Year = integer(), Week = integer(), Slope = double())
WeeklySlopes(2017, 15)
WeeklySlopes(2017, 14)
head(WeeklySlopesDf)
Is there really no other way to append a row to my existing dataframe. I seem to need to access the globalenv. On the other hand, why can sqldf 'see' the 'global' dataframe SourceData?
dfrm <- data.frame(a=1:10, b=letters[1:10]) # reproducible example
myfunc <- function(new_a=20){ g <- globalenv(); g$dfrm[3,1] <- new_a; cat(dfrm[3,1])}
myfunc()
20
dfrm
a b
1 1 a
2 2 b
3 20 c # so your strategy might work, although it's unconventional.
Now try to extend dataframe outside a function:
dfrm[11, ] <- c(a=20,b="c")
An occult disaster (conversion of numeric column to character):
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: chr "1" "2" "20" "4" ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So use a list to avoid occult coercion:
dfrm <- data.frame(a=1:10, b=letters[1:10]) # start over
dfrm[11, ] <- list(a=20,b="c")
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
Now try within a function:
myfunc <- function(new_a=20, new_b="ZZ"){ g <- globalenv(); g$dfrm[nrow(dfrm)+1, ] <- list(a=new_a,b=new_b)}
myfunc()
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "ZZ") :
invalid factor level, NA generated
str(dfrm)
'data.frame': 12 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So it succeeds, but if there are any factor columns, non-existent levels will get turned into NA values (with a warning). You method of using named access to objects in the global environment is rather unconventional but there is a set of tested methods that you might want to examine. Look at ?R6. Other options are <<- and assign which allows one to specify the environment in which the assignment is to occur.

Transforming a nested data frame with varying number of elements

I have a data frame with a column of nested data frames with 1 or 2 columns and n rows. It looks like df in the sample below:
'data.frame': 3 obs. of 2 variables:
$ vector:List of 3
..$ : chr "p1"
..$ : chr "p2"
..$ : chr "p3"
$ lists :List of 3
..$ :'data.frame': 2 obs. of 2 variables:
.. ..$ n1: Factor w/ 2 levels "a","b": 1 2
.. ..$ n2: Factor w/ 2 levels "1","2": 1 2
..$ :'data.frame': 1 obs. of 1 variable:
.. ..$ n1: Factor w/ 1 level "d": 1
..$ :'data.frame': 1 obs. of 2 variables:
.. ..$ n1: Factor w/ 1 level "e": 1
.. ..$ n2: Factor w/ 1 level "3": 1
df can be recreated like this :
v <- c("p1", "p2", "p3")
l <- list(data.frame(n1 = c("a", "b"), n2 = c("1", "2")), data.frame(n1 = "d"), data.frame(n1 = "e", n2 = "3"))
df <- as.data.frame(cbind(v, l))
I'd like to transform it to a data frame that looks like that:
[v] [n1] [n2]
p1 a 1
p1 b 2
p2 d NA
p3 e 3
n1 and n2 are in seperate columns
if the data frame in row i has n rows, the vector element of row i should be repeated n times
if there is no content in n1 or n2, there should be a NA
I've tried using tidyr::unnest but got the following error
unnest(df)
Error: All nested columns must have the same number of elements.
Does anyone has a better idea how to transform the dataframe in the desired format?
Using purrr::pmap_df, within each row of df, we combine v and l into a single data frame and then combine all of the data frames into a single data frame.
library(tidyverse)
pmap_df(df, function(v,l) {
data.frame(v,l)
})
v n1 n2
1 p1 a 1
2 p1 b 2
3 p2 d <NA>
4 p3 e 3
This will avoid by-row operations, which will be important if you have a lot of rows.
library(data.table)
rbindlist(df$l, fill = T, id = 'row')[, v := df$v[row]][]
# row n1 n2 v
#1: 1 a 1 p1
#2: 1 b 2 p1
#3: 2 d NA p2
#4: 3 e 3 p3
A solution using dplyr and tidyr. suppressWarnings is not required. Because when you created data frames, there are factor columns, suppressWarnings is to suppress the warning message when combining factors.
library(dplyr)
library(tidyr)
df1 <- suppressWarnings(df %>%
mutate(v = unlist(.$v)) %>%
unnest())
df1
# v n1 n2
# 1 p1 a 1
# 2 p1 b 2
# 3 p2 d <NA>
# 4 p3 e 3

Grouping by row- data.table type change

This is related to the question Group by in data.table in R which only keep non NA values from columns
Example:
I have
df <- data.frame(x = c('a', 'a', 'b', 'b' ), y = c(1,NA,2,NA), z = c(NA, 3, NA, 4))
df
x y z
1 a 1 NA
2 a NA 3
3 b 2 NA
4 b NA 4
and I want
df2 <- data.frame(x = c('a', 'b' ), y = c(1,2), z = c(3,4))
df2
x y z
1 a 1 3
2 b 2 4
I am having the same issue as in the above question and I tried the accepted answer and it worked, but it changed the type of the contents in my data frame. I need them to stay as numeric values for downstream analysis and using as.numeric afterwards did not work. I also tried solving the initial question using dplyr group_by but it didn't work either so I guess I am misunderstandig the function (still a beginner in R and data analysis in general!).
Sorry for the very basic question but I have been stuck trying to solve this for a while! Any suggestions are welcome.
Thanks!
We can do this with data.table
library(data.table)
dt1 <- setDT(df)[, lapply(.SD, function(x) x[!is.na(x)]), x]
str(dt1)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
#$ x: Factor w/ 2 levels "a","b": 1 2
#$ y: num 1 2
#$ z: num 3 4
str(df)
#Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
#$ x: Factor w/ 2 levels "a","b": 1 1 2 2
#$ y: num 1 NA 2 NA
#$ z: num NA 3 NA 4
If we needed, we can change the 'dt1' to 'data.frame' with the setDF
setDF(dt1)

re-convert data types in R

I have a subset of data within a large dataset that does not conform to the original data types assigned when the data was read into R. How can I re-convert the data types for the subset of data, just as R would do if only that subset was read?
Example: imagine that there is one stack of data consisting of variables 1-4 (v1 to v4) and a different set of data starting with column names v5 to v8.
V1 V2 V3 V4
1 32 a 11 a
2 12 b 32 b
3 3 c 42 c
4 v5 v6 v7 v8
5 a 43 a 35
6 b 33 b 64
7 c 55 c 32
If I create a new df with v5-v8, how can I automatically "re-convert" the entire data to appropriate types? (Just as R would do if I were to re-read the data from a csv)
You could try type.convert
df1 <- df[1:3,]
str(df1)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: chr "32" "12" "3"
# $ V2: chr "a" "b" "c"
# $ V3: chr "11" "32" "42"
# $ V4: chr "a" "b" "c"
df1[] <- lapply(df1, type.convert)
str(df1)
#'data.frame': 3 obs. of 4 variables:
#$ V1: int 32 12 3
#$ V2: Factor w/ 3 levels "a","b","c": 1 2 3
#$ V3: int 11 32 42
#$ V4: Factor w/ 3 levels "a","b","c": 1 2 3
To subset the dataset, you could use grep (as #Richard Scriven mentioned in the comments)
indx <- grep('^v', df[,1])
df2 <- df[(indx+1):nrow(df),]
df2[] <- lapply(df2, type.convert)
Suppose, your dataset have many instances where this occurs, split the dataset based on a grouping index (indx1) created by grepl after removing the header rows (indx) and do the type.convert within the "list".
indx1 <- cumsum(grepl('^v', df[,1]))+1
lst <- lapply(split(df[-indx,],indx1[-indx]), function(x) {
x[] <- lapply(x, type.convert)
x})
Then, if you need to cbind the columns (assuming that the nrow is same for all the list elements)
dat <- do.call(cbind, lst)

Resources