R: Find identical strings in two lists - r

I have two lists of strings and want to find which strings are in both lists.
I tried converting the lists to vectors so that I could use intersect or setequal but that converted all the strings to numbers and (apologies if there's an obvious answer I can't figure out), I can't seem to convert the lists without that happening.
What's the best way forward?
EDIT:
I have these data frames:
dput(s)
structure(list(V1 = structure(c(3L, 2L, 1L, 4L), .Label = c("24d2afb212410711de0e237e5435e104",
"2a3d9ca791a579a14883de538a012e24", "a90b03209a8095ec406809d89d5035c3",
"f271eb38cc409c6bfe9dcf2bfcab8471"), class = "factor")), .Names = "V1", row.names = c(NA,
-4L), class = "data.frame")
dput(r)
structure(list(V1 = structure(c(2L, 1L, 4L, 3L), .Label = c("24d2afb212410711de0e237e5435e104",
"2a3d9ca791a579a14883de538a012e24", "7320e2e921df862968954d4b60e2a80a",
"a9f47ec7c488d2bcddf2c1adc2bf6305"), class = "factor")), .Names = "V1", row.names = c(NA,
-4L), class = "data.frame")
I want to find the strings that are in both, i.e.
2a3d9ca791a579a14883de538a012e24 and 24d2afb212410711de0e237e5435e104.
as.character() doesn't work for preserving those strings; is there something else that would work for converting them into factors or is there another operation that would work better?

You need to specify the columns too within your data frames.
With intersect,
intersect(r$V1, s$V1)
#[1] "2a3d9ca791a579a14883de538a012e24" "24d2afb212410711de0e237e5435e104"
With grep,
unlist(sapply(r$V1, function(i)grep(i, s$V1, value = TRUE)))
#[1] "2a3d9ca791a579a14883de538a012e24" "24d2afb212410711de0e237e5435e104"

Related

Formatting issues when removing row numbers in datatable

I am using the R package DT to create a table. This table contains hyperlinks and the issue that I am having is that when I put rownames = FALSE to remove the row numbers, the formatting on the hyperlinks goes away. I was wondering if anyone had a solution to this problem?
Example data:
structure(list(school = structure(c(2L, 3L, 1L, 4L), .Label = c("Linfield",
"OSU", "UO", "Willamette"), class = "factor"), mascot = structure(c(2L,
3L, 4L, 1L), .Label = c("bearcats", "beavers", "ducks", "wildcats"
), class = "factor"), website = structure(c(1L, 3L, 2L, 4L), .Label = c("oregonstate.edu",
"linfield.edu", "uoregon.edu",
"willamette.edu"), class = "factor"),
School_colors = structure(c(2L, 1L, 3L, 4L), .Label = c("<span style=\"color:green\">green & yellow</span>",
"<span style=\"color:orange\">orange & black</span>", "<span style=\"color:purple\">purple and red</span>",
"<span style=\"color:red\">red and yellow</span>"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
Code used to generate table WITH row names
datatable(df,escape = c(1,2,3))
Code used to generate table WITHOUT row names
datatable(df, rownames = FALSE,escape = c(1,2,3))
As you can see, with the second example code, the formatting in the third column is no longer there. What I want to do is create a table without row numbers but also keep the formatting of the hyperlinks
Since your deleted the rownames the indexes of your columns changed as well.
Therefore you should change your escape argument to only 1 and 2.
datatable(df, rownames = FALSE,escape = c(1,2))
A simple escape = FALSE also works for you.
datatable(df, rownames = FALSE, escape = FALSE)

Simple text cleaning into all columns of a dataframe frame

I have a dataframe which I would like to implement some basic formation rules.
The dataframe is:
df <- structure(list(colname1 = structure(c(2L, 1L, 1L), .Label = c("",
"TEXTA"), class = "factor"), colname2 = structure(c(2L, 1L, 3L
), .Label = c("TEXTA", "TEXTB", "TEXTE"), class = "factor"),
colname3 = structure(c(2L, 3L, 1L), .Label = c("", "TEXTC",
"TEXTD"), class = "factor")), .Names = c("colname1", "colname2",
"colname3"), class = "data.frame", row.names = c(NA, -3L))
I try to run the following for the whole dataframe data:
df2 <- as.data.frame(tolower(df))
df2 <- as.data.frame(gsub("[[:punct:]]", "", df2))
but this converts the column names of dataframe to rows. What can I do to make in lower case and remove punctuation from all rows of the example dataframe (I am not interesting for colnames)?
We remove the punctuation characters on each column by looping through the columns (lapply(df, ..), assign the output back to the original dataset
df[] <- lapply(df, function(x) gsub("[[:punct:]]+", "", tolower(x)))
Using tidyverse, this can be done by
library(dplyr)
df %>%
mutate_all(funs(gsub("[[:punct:]]+", "", tolower(.))))

summary and descriptive table for mixed data in R

I want to make a function that calculates some pre-determined summary statistic measures that I can apply to any dataset. I'll start off with an example here, but this is for datasets that could have a variety of datatypes - such as character, factor, numerical, dates, containing null values, etc.
I can do this easy enough if the data is all numeric - but handling the IF scenarios w/ apply, sapply, etc is where I run into trouble with the syntax.
When its all numeric I'm great since I can just do new_df = data.frame(min = sapply(mydf, 2,min).....etc....etc). I just can't get the syntax right when its more complicated like in my example below.
In the example below I have a data frame of 3 columns:
all numerical
numerical with a null
categorical column of data coded as a factor
I want to calculate the:
type...(character, factor, date, numeric, etc)
mean...when the data-type is numeric obviously , and excluding nulls
number of null values in the dataset
I think this is simple enough and I can run with it from here..
copy and paste this code and name as a variable for the data frame:
structure(list(allnumeric = c(10, 20, 30, 40), char_or_factor = structure(c(2L,
3L, 3L, 1L), .Label = c("bird", "cat", "dog"), class = "factor"),
num_with_null = c(10, 100, NA, NA)), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c(NA, -4L), class = "data.frame")
expected solution data frame (copy and assign to a variable):
structure(list(allnumeric = structure(c(3L, 2L, 1L), .Label = c("0",
"25", "numeric"), class = "factor"), char_or_factor = structure(c(2L,
NA, 1L), .Label = c("0", "character"), class = "factor"), num_with_null = structure(c(3L,
2L, 1L), .Label = c("2", "55", "numeric"), class = "factor")), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c("type", "mean",
"num_nulls"), class = "data.frame")
We can use sapply to loop over the columns, get the class, mean and number of NA elements, concatenate (c() and convert to data.frame
as.data.frame(sapply(df1, function(x) c(class(x), mean(x, na.rm=TRUE),
sum(is.na(x)))), stringsAsFactors=FALSE)

Collapse and aggregate several row values by date

I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.

as.numeric is rounding off values

I am trying to convert a character column from a data frame to the numerics. However, what I am getting as a result are rounded up values.
Whatever I have tried by researching other questions of the same nature on SO, hasn't worked for me. I have checked the class of the column vector I am trying to convert, and it is a character, not a factor.
Here is my code snippet:
some_data <- read.csv("file.csv", nrows = 100, colClasses = c("factor", "factor", "character", "character"))
y <- Vectorize(function(x) gsub("[^\\.\\d]", "", x, perl = TRUE))
some_data$colC <- y(data1$colC)
data1$colD <- y(data1$colCD)
data1$colC <- as.numeric(data1$colC)
data1$colD <- as.numeric(data1$colD)
Edit:
> dput(head(data1))
structure(list(colA = structure(c(2L, 2L, 5L, 6L, 5L, 6L), .Label = c("(Other)",
"Direct", "Display", "Email", "Organic Search", "Paid Search",
"Referral", "Social Network"), class = "factor"), colB = structure(c(1L,
2L, 2L, 2L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
colC = c("4023107.87", "3180863.42", "2558777.81", "2393736.25",
"1333148.48", "1275627.13"), colD = c("49731596.35", "33604210.26",
"20807573.12", "20061467.30", "10488358.77", "10442249.09"
)), .Names = c("colA", "colB", "colC", "colD"), row.names = c(NA,
6L), class = "data.frame")
I think this is a representation problem, not an actual rounding problem ...
options("digits") ## 7
From ?options:
‘digits’: controls the number of digits to print when printing numeric values. It is a suggestion only. Valid values are
1...22 with default 7. See the note in ‘print.default’ about
values greater than 15.
digits can be reset either on a one-off basis, i.e. print(object,digits=...), or globally, i.e. options(digits=20) (20 is probably overkill but helps you see what's happening: based on the results below, 10 might serve your needs well.)
as.numeric(data1$colC)
[1] 4023108 3180863 2558778 2393736 1333148 1275627
print(as.numeric(data1$colC),digits=10)
[1] 4023107.87 3180863.42 2558777.81 2393736.25 1333148.48 1275627.13
print(as.numeric(data1$colC),digits=20)
[1] 4023107.8700000001118 3180863.4199999999255 2558777.8100000000559
[4] 2393736.2500000000000 1333148.4799999999814 1275627.1299999998882

Resources