How to produce an if statement to compare rows in R - r

I need to compare two rows next to each other in a column in a dataframe, if the data in both those rows matches, then save the most recent row, e.g.
# Animals
# 1 dog
# 2 cat
# 3 cat
It should compare dog and cat, then not save any data. So it won't save row 1 and 2.
But when it moves onto compare cat and cat, realise they are the same and save those rows. So save rows 2 and 3. As they are the same. There are several other columns but the animals column is the only one I need to use to decide whether the row is saved. However I want to keep all the data in the columns within the saved rows.
I need to do this for lots of rows, iterating through to compare a big set of data (~68,000)
I've tried to produce an if statement in which:
# results <- list()
#
# if(isTRUE(data$Animals[i+1] == data$Animals[i])) {
# output <- print(data$Animals[i+1])
# results[[i+1]] <- output
# output <- print(data$Animals[i])
# results[[i]] <- output
# }
#}
I then converted this results list into a dataframe for further manipulation. However this method only provides me with the animal name, I would prefer it the entire row was saved. I'm not too sure how to achieve this, I've been trying to edit the statement but I can't seem to get it working.
I'm new to R and learning, please help anyway you can, I'd appreciate it :)

To "prove" that we're saving the "most recent row", I'll add a row-number column. The data:
dat <- structure(list(Animals = c("dog", "cat", "cat"), row = 1:3), row.names = c(NA, -3L), class = "data.frame")
dat
# Animals row
# 1 dog 1
# 2 cat 2
# 3 cat 3
base R
dat[c(with(dat, Animals[-nrow(dat)] != Animals[-1])),,drop=FALSE]
# Animals row
# 1 dog 1
# 3 cat 3
dplyr
library(dplyr)
dat %>%
filter(Animals != lead(Animals, default = ''))
# Animals row
# 1 dog 1
# 2 cat 3
The only caution I have with this is that if package-loading is at all out-of-order, there exists both stats::filter and stats::lag that behave completely differently. If you see odd results, try prepending dplyr:: to make sure it isn't a which-function-am-I-using problem.
dat %>%
dplyr::filter(Animals != dplyr::lead(Animals, default = ''))

We could use lead and filter
library(dplyr)
df %>%
mutate(helper = lead(animals)) %>%
filter(animals == helper) %>%
select(animals)
Output:
animals
<chr>
1 cat

Related

How to search for words with asterisks and wildcards (e.g., exampl*) in R (word appearance in a data frame)

I wrote a code to count the appearance of words in a data frame:
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df_main$words == item[i])}
word_freq <- data.frame(cbind(item, count))
word_freq
However, the results are like this:
item
count
1
decid*
0
2
head
1
3
heads
1
As you see, it does not correctly count for "decid*". The actual results I expect should be like this:
item
count
1
decid*
2
2
head
1
3
heads
1
I think I need to change the item word (decid*) format, however, I could not figure it out. Any help is much appreciated!
I think you want to use decid* as regex pattern. == looks for an exact match, you may use grepl to look for a particular pattern.
I have used sapply as an alternative to for loop.
result <- stack(sapply(unique(df1$Items), function(x) {
if(grepl('*', x, fixed = TRUE)) sum(grepl(x, df_main$word))
else sum(x == df_main$words)
}))
result
# values ind
#1 2 decid*
#2 1 head
#3 1 heads
Using tidyverse
library(dplyr)
library(stringr)
df1 %>%
rowwise %>%
mutate(count =sum(str_detect(df_main$words,
str_c("\\b", str_replace(Items, fixed("*"), ".*" ), "\\b")))) %>%
ungroup
-output
# A tibble: 3 × 2
Items count
<chr> <int>
1 decid* 2
2 head 1
3 heads 1
Perhaps as an alternative approach altogether: instead of creating a new dataframe word_freq, why not create a new column in df_main(if that's your "main" dataframe) which indicates the number of matches of your (apparently key)Items. Also, that column will not actually contain counts because the input column words only contains a single word each. So the question is not how many matches are there for each row but whether there is a match in the first place. That can be indicated by greplin base Ror str_detectin stringr
EDIT:
Given the newly posted input data
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
and the OP's wish to have the matches in df_main, the solution might be this:
library(stringr)
df_main$Items_match <- +str_detect(df_main$words, str_c(Items, collapse = "|"))
Result:
df_main
words Items_match
1 head 1
2 heads 1
3 decided 1
4 decides 1
5 top 0
6 undecided 1

Which() for the whole dataset

I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)
Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.
The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1

Efficiently transform XML to data frame

I need to transform some vanilla xml into a data frame. The XML is a simple representation of rectangular data (see example below). I can achieve this pretty straightforwardly in R with xml2 and a couple of for loops. However, I'm sure there is a much better/faster way (purrr?). The XML I will be ultimately working with are very large, so more efficient methods are preferred. I would be grateful for any advice from the community.
library(tidyverse)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
episodes <- xml_find_all(dx, xpath = "//EPISODE")
dx_names <- xml_name(xml_children(episodes[1]))
df <- data.frame()
for(i in seq_along(episodes)) {
for(j in seq_along(dx_names)) {
df[i, j] <- xml_text(xml_find_all(episodes[i], xpath = dx_names[j]))
}
}
names(df) <- dx_names
df
#> item1 item2
#> 1 A 1
#> 2 B 2
Created on 2019-09-19 by the reprex package (v0.3.0)
Thank you in advance.
This is a general solution which handles a varying number of different sub-nodes for each parent node. Each Episode node may have different sub-nodes.
This strategy parses the children nodes identifying the name and values of each sub node. Then it converts this list into a longer style dataframe and then reshapes it into your desired wider style:
library(tidyr)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
#find all episodes
episodes <- xml_find_all(dx, xpath = "//EPISODE")
#extract the node names and values from all of the episodes
nodenames<-xml_name(xml_children(episodes))
contents<-trimws(xml_text(xml_children(episodes)))
#Idenitify the number of subnodes under each episodes for labeling
IDlist<-rep(1:length(episodes), sapply(episodes, length))
#make a long dataframe
df<-data.frame(episodes=IDlist, nodenames, contents, stringsAsFactors = FALSE)
#make the dataframe wide, Remove unused blank nodes:
answer <- spread(df[df$contents!="",], nodenames, contents)
#tidyr 1.0.0 version
#answer <- pivot_wider(df, names_from = nodenames, values_from = contents)
# A tibble: 2 x 3
episodes item1 item2
<int> <chr> <chr>
1 1 A 1
2 2 B 2
This may be an option without using a for loop,
episodes <- xml_find_all(dx, xpath = "//EPISODE") %>% xml_attr("item1")
dx_names <- xml_name(xml_children(episodes[1]))
# You can get all values between the tags by xml_text()
values <- xml_children(episodes) %>% xml_text()
as.data.frame(matrix(values,
ncol=length(dx_names),
dimnames =list(seq(dx_names),dx_names),byrow=TRUE))
gives,
item1 item2
1 A 1
2 B 2
Note that, you may need to change the Item2 column to a numeric one by as.numeric() since it's been assigned as factor by this solution.

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources