Comparing character lists in R - r

I have two lists of characters that i read in from excel files
One is a very long list of all bird species that have been documented in a region (allBirds) and another is a list of species that were recently seen in a sample location (sampleBirds), which is much shorter. I want to write a section of code that will compare the lists and tell me which sampleBirds show up in the allBirds list. Both are lists of characters.
I have tried:
# upload xlxs file
Full_table <- read_excel("Full_table.xlsx")
Pathogen_table <- read_excel("pathogens.xlsx")
# read species columnn into a new dataframe
species <-c(as.data.frame(Full_table[,7], drop=FALSE))
pathogens <- c(as.data.frame(Pathogen_table[,3], drop=FALSE))
intersect(pathogens, species)
intersect(species, pathogens)
but intersect is outputting lists of 0, which I know cannot be true, any suggestions?

Maybe you can try match() function or "==".

You need to run the intersect on the individual columns that are stored in the list:
> a <- c(data.frame(c1=as.factor(c('a', 'q'))))
> b <- c(data.frame(c1=as.factor(c('w', 'a'))))
> intersect(a,b)
list()
> intersect(a$c1,b$c1)
[1] "a"
This will probably do in your case
intersect(Full_table[,7], Pathogen_table[,3])
Or if you insist on creating the data.frames:
intersect(pathogens[1,], species[1,])
where [1,] should select the first column of the data.frame only. Note that by using c(as.data.frame(... you are converting the data.frame to a regular list. I'd go with only as.data.frame(....

Related

Dynamically change part of variable name in R

I am trying to automatise some post-hoc analysis, but I will try to explain myself with a metaphor that I believe will illustrate what I am trying to do.
Suppose I have a list of strings in two lists, in the first one I have a list of names and in the other a list of adjectives:
list1 <- c("apt", "farm", "basement", "lodge")
list2 <- c("tiny", "noisy")
Let's suppose also I have a data frame with a bunch of data that I have named something like this as they are the results of some previous linear analysis.
> head(df)
qt[apt_tiny,Intercept] qt[apt_noisy,Intercept] qt[farm_tiny,Intercept]
1 4.196321 -0.4477012 -1.0822793
2 3.231220 -0.4237787 -1.1433449
3 2.304687 -0.3149331 -0.9245896
4 2.768691 -0.1537728 -0.9925387
5 3.771648 -0.1109647 -0.9298861
6 3.370368 -0.2579591 -1.0849262
and so on...
Now, what I am trying to do is make some automatic operations where the strings in the previous lists dynamically change as they go in a for loop. I have made a list with all the distinct combinations and called it distinct. Now I am trying to do something like this:
for (i in 1:nrow(distinct)){
var1[[i]] <- list1[[i]]
var2[[i]] <- list2[[i]]
#this being the insertable name part for the rest of the variables and parts of variable,
#i'll put it inside %var[[i]]% for the sake of the explanation.
%var1[[i]]%_%var2[[i]]%_INT <- df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept]`+ df$`qt[%var1[[i]]%,Intercept]`
}
The difficult thing for me here is %var1[[i]]% is at the same time inside a variable and as the name of a column inside a data frame.
Any help would be much appreciated.
You cannot use $ to extract column values with a character variable. So df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept] will not work.
Create the name of the column using sprintf and use [[ to extract it. For example to construct "qt[apt_tiny,Intercept]" as column name you can do :
i <- 1
sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])
#[1] "qt[apt_tiny,Intercept]"
Now use [[ to subset that column from df
df[[sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])]]
You can do the same for other columns.

How to find common variables in a list of datasets & reshape them in R?

setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i], nonint.factors = TRUE))
grep(pattern="_m", temp, value=TRUE)
Here I create a list of my datasets and read them into R, I then attempt to use grep in order to find all variable names with pattern _m, obviously this doesn't work because this simply returns all filenames with pattern _m. So essentially what I want, is my code to loop through the list of databases, find variables ending with _m, and return a list of databases that contain these variables.
Now I'm quite unsure how to do this, I'm quite new to coding and R.
Apart from needing to know in which databases these variables are, I also need to be able to make changes (reshape them) to these variables.
First, assign will not work as you think, because it expects a string (or character, as they are called in R). It will use the first element as the variable (see here for more info).
What you can do depends on the structure of your data. read.dta13 will load each file as a data.frame.
If you look for column names, you can do something like that:
myList <- character()
for (i in 1:length(temp)) {
# save the content of your file in a data frame
df <- read.dta13(temp[i], nonint.factors = TRUE))
# identify the names of the columns matching your pattern
varMatch <- grep(pattern="_m", colnames(df), value=TRUE)
# check if at least one of the columns match the pattern
if (length(varMatch)) {
myList <- c(myList, temp[i]) # save the name if match
}
}
If you look for the content of a column, you can have a look at the dplyr package, which is very useful when it comes to data frames manipulation.
A good introduction to dplyr is available in the package vignette here.
Note that in R, appending to a vector can become very slow (see this SO question for more details).
Here is one way to figure out which files have variables with names ending in "_m":
# setup
setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
# logical vector to be filled in
inFileVec <- logical(length(temp))
# loop through each file
for (i in 1:length(temp)) {
# read file
fileTemp <- read.dta13(temp[i], nonint.factors = TRUE)
# fill in vector with TRUE if any variable ends in "_m"
inFileVec[i] <- any(grepl("_m$", names(fileTemp)))
}
In the final line, names returns the variable names, grepl returns a logical vector for whether each variable name matches the pattern, and any returns a logical vector of length 1 indicating whether or not at least one TRUE was returned from grepl.
# print out these file names
temp[inFileVec]

Turn Multiple Uneven Nested Lists Into A DataFrame in R

I am trying to get to grips with R and as an experiment I thought that I would try to play around with some cricket data. In its rawest format it is a yaml file, which I used the yaml R package to turn into an R object.
However, I now have a number of nested lists of uneven length that I want to try and turn into a data frame in R. I have tried a few methods such as writing some loops to parse the data and some of the functions in the tidyr package. However, I can't seem to get it to work nicely.
I wondered if people knew of the best way to tackle this? Replicating the data structure would be difficult here, because the complexity comes in the multiple nested lists and the unevenness of their length (which would make for a very long code block. However, you can find the raw yaml data here: http://cricsheet.org/downloads/ (I was using the ODI internationals).
Thanks in advance!
Update
I have tried this:
1)Using tidyr - seperate
d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")
Which basically uses the tidyr package returns a single column dataframe that I then try to separate. However, the problem here is that in the middle there is sometimes extras variables that appear in some rows and not others, thereby throwing off the separation.
2) Creating vectors
ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector
The problem here is that the resulting vectors are uneven in length and therefore cant be combined into a data frame. It will also suffer from the issue that there are sporadic variables in the middle of the dataset (as above).
Caveat: not complete answer; attempt to arrange the innings data
plyr::rbind.fill may offer a solution to binding rows with a different number of columns.
I dont use tidyr but below is some rough code to get the innings data into a data.frame. You could then loop this through all the yaml files in the directory.
# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)
# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)
# Function to process list into dataframe
p_fun <- function(X) {
team = X[[1]][["team"]]
# function to process each list subelement that represents each throw
fn <- function(...) {
tmp = unlist(...)
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
}
# loop over all throws
lst = lapply(X[[1]][["deliveries"]], fn )
cbind(team, plyr:::rbind.fill(lst))
}
# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))
Some explanation
The list structure and subsetting it. To get an idea of the structure of the list use
str(raw_dat) # but this gives a really long list of data
You can truncate this, to make it a bit more useful
str(raw_dat, 3)
length(raw_dat)
So there are three main list elements - meta, info, and innings. You can also see this with
names(raw_dat)
To access the meta data, you can use
raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]
The main data is in the inningselement.
str(raw_dat$innings, 3)
Look at the names in the list element
lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)
There are two list elements, each with sub-elements. You can access these as
raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]
The above function parsed the deliveries data in raw_dat$innings. To see what it does, work through it from the inside.
Use one record to see how it works
(note the lapply, with p_fun, looped over raw_dat$innings[[1]] and raw_dat$innings[[2]] ; so this is the outer loop, and the lapply, with fn, loops through the deliveries, within an innings ; the inner loop)
X <- raw_dat$innings[[1]]
tmp <- X[[1]][["deliveries"]][[1]]
tmp
#create a named vector
tmp <- unlist(tmp)
tmp
# 0.1.batsman 0.1.bowler 0.1.non_striker 0.1.runs.batsman 0.1.runs.extras 0.1.runs.total
# "IR Bell" "DW Steyn" "MJ Prior" "0" "0" "0"
To use rbind.fill, the elements to bind together need to be data.frames. We also want to remove the leading numbers /
deliveries from the names, as otherwise we will have lots of uniquely names columns
# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp))
# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))
So for the first delivery in the first innings we have
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
To see how the lapply works, use the first three deliveries (you will need to run the function fn in your workspace)
lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[2]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 02 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[3]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 03 IR Bell DW Steyn MJ Prior 3 0 3
So we end up with a list element for every delivery within an innings. We then use rbind.fill to create one data.frame.
If I was going to try and parse every yaml file I would use a loop.
Use the first three records as an example, and also add the match date.
tmp <- unzip(temp)[2:4]
all_raw_dat <- vector("list", length=length(tmp))
for(i in seq_along(tmp)) {
d = yaml.load_file(tmp[i])
all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}
Then use rbind.fill.
Q1. from comments
A small example with rbind.fill
a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)
rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)
rbind.fill doesnt go back and add/update rows with the extra columns, where needed (a still doesnt have column z), Think of it as creating an empty dataframe with the number of columns equal to the number of unique columns found in the list of dataframes - unique(c(names(a), names(b))). The values are then filled in each row where possible, and left missing (NA) otherwise..

R select names from a list using logical vector

Suppose I have my list with names and its component and I want to get those names which have its components in other vector:
that is my list neighbors
neighbors[[1]]
[1] "CNBP" "IGF2BP1" "RPL3|OK/SW-cl.32"
[4] "HNRNPC" "PURA|hCG_45299" "RPS3A"
"Cnbp" "Mis12|DN-393H17.5"
neighbors[[2]]
[1] "NIN" "PRKACA" "AURKA|RP5-1167H4.6"
[4] "GSK3B" "AMOT" "UBC"
and my vector of interest
mtop
[1] "TUBA1A" "DNAJB1" "MME"
[4] "PRKCB" "PARK2|KB-152G3.1" "UBC"
My idea for example is return the name of neighbors[2], which have in common UBC
Any ideas??
First off, your data. Your output appears sonewhat strange. If this is not what you have, consider using dput to dump these variables in a reproducible way.
mtop <- c("TUBA1A", "DNAJB1", "MME",
"PRKCB", "PARK2|KB-152G3.1", "UBC")
neighbors <- list(c("CNBP", "IGF2BP1", "RPL3|OK/SW-cl.32",
"HNRNPC", "PURA|hCG_45299", "RPS3A",
"Cnbp", "Mis12|DN-393H17.5"),
c("NIN", "PRKACA", "AURKA|RP5-1167H4.6",
"GSK3B", "AMOT", "UBC"))
To select those elements of list neighbors which have at least one vector element in common with mtop, you can use this command:
matching <- sapply(neighbors, function(l) length(intersect(mtop, l)) > 0)
print(neighbors[matching])
This will print neighbors[2], as it has "UBC" in common with mtop. It does this via the logical vector matching. Which seems to be what your question asked.
If you want to take position into account, i.e. only select neighbors[2] because "UBC" is in position 6 in both vectors, then you should use this command:
matching <- sapply(neighbors, function(l) any(l == mtop))
However, this will create a warning, as neighbors[[1]] is longer than mtop.
If you want the names common to both your data structures, you can use this code:
intersect(unlist(neighbors), mtop)
If you need something else, you have to be more specific in your question, i.e. give an explicit example of what the output should look like, and cover all the possible input configurations that might lead to structurally different output.
How about:
l<- lapply(neighbours,function(x)x[x %in% mtop])
This will return the list where each list element will have the elements which are in the vector mtop.
Now select only those elements which have non-zero length:
names(l)[sapply(l,length)>0]
You can combine these into one line:
names(neighbours)[sapply(neighbours,function(x)Reduce("|",mtop %in% x))]

basic R question on manipulating dataframes

I have a data frame with several columns. rows have names.
I want to calculate some value for each row (col1/col2) and create a new data frame with the original row names. If I just do something like data$col1/data$col2 I get a vector with the results but lose the row names.
i know it's very basic but I'm quite new to R.
It would help to read ?"[.data.frame" to understand what's going on. Specifically:
Note that there is no ‘data.frame’
method for ‘$’, so ‘x$name’ uses the
default method which treats ‘x’ as a
list.
You will see that the object's names are lost if you convert a data.frame to a list (using Joris' example data):
> as.list(Data)
$col1
[1] -0.2179939 -2.6050843 1.6980104 -0.9712305 1.6953474 0.4422874
[7] -0.5012775 0.2073210 1.0453705 -0.2883248
$col2
[1] -1.3623349 0.4535634 0.3502413 -0.1521901 -0.1032828 -0.9296857
[7] 1.4608866 1.1377755 0.2424622 -0.7814709
My suggestion would be to avoid using $ if you want to keep row names. Use this instead:
> Data["col1"]/Data["col2"]
col1
a 0.1600149
b -5.7435947
c 4.8481157
d 6.3816918
e -16.4146120
f -0.4757387
g -0.3431324
h 0.1822161
i 4.3114785
j 0.3689514
use the function names() to add the names :
Data <- data.frame(col1=rnorm(10),col2=rnorm(10),row.names=letters[1:10])
x <- Data$col1/Data$col2
names(x) <- row.names(Data)
This solution gives a vector with the names. To get a data-frame (solution from Marek) :
NewFrame <- data.frame(x=Data$col1/Data$col2,row.names=row.names(Data))
A very simple and neat way is to use row.names(data frame) to store it as a column and further manipulate

Resources