R:How to intersect list of dataframes and specifc column

R:How to intersect list of dataframes and specifc column - r

I am trying to find all matching values in a specific column, in a list of data.frames. However, I keep getting a returned value of character(0).
I have tried the following:
Simple subset (very time consuming) -> e.g. dat[[i]][[i]]
lapply w/ Reduce and intersect (as seen here
LocA<-data.frame(obs.date=c("2018-01-10","2018-01-14","2018-01-20),
obs.count=c(2,0,1))
LocB<-data.frame(obs.date=c("2018-01-09","2018-01-14","2018-01-20),
obs.count=c(0,3,5))
LocC<-data.frame(obs.date=c("2018-01-12","2018-01-14","2018-01-19"),
obs.count=c(2,0,1))
LocD<-data.frame(obs.date=c("2018-01-11","2018-01-16","2018-01-21"),
obs.count=c(2,0,1))
dfList<-list(LocA,LocB,LocC,LocD)
##List of all dates
lapply(dfList,'[[',1)
[1]"2018-01-10" "2018-01-14" "2018-01-20" "2018-01-09"...
Attempts (failure)
>Reduce(intersect,lapply(dfList,'[[',1))
character (0)
I expect the output of this function to be an output identifying the data.frames that share a common date.
*Extra smiles if someone know how to identify shared dates and mutate in to a single data frame where..Col1 = dataframe name, Col2=obs.date,Col3 = obs.count

You can first merge all the data frames so you only have one:
a <- Reduce(function(x, y) merge(x, y, all = TRUE), dfList)
Or you can merge them like this:
a <-rbind(LocA,LocB,LocC,LocD)
Afterwards, you can extract all the duplicates:
b <- a[duplicated(a$obs.date), ]
Or if you want to keep all the unique ones and keep the duplicates:
c <- a[!duplicated(a$obs.date), ]

If by "intersect" you mean doing an "inner join" or "merging" with a specific column as key, then -- you want to use dplyr::inner_join or merge.
First, between two data.frames:
library(dplyr)
inner_join(LocA, LocB, by='obs.date')
# 2 rows
inner_join(LocC, LocD, by='obs.date')
# zero rows
So, not infinite merging.
Stack, then count
We'll combine the data first, then count the occurences. Notice the use of the .id-argument to track where the row originated.
library(dplyr)
bind_rows(dfList, .id = 'id') %>%
add_count(obs.date) %>%
filter(n > 1)
# A tibble: 5 x 4
id obs.date obs.count n
<chr> <chr> <dbl> <int>
1 1 2018-01-14 0 3
2 1 2018-01-20 1 2
3 2 2018-01-14 3 3
4 2 2018-01-20 5 2
5 3 2018-01-14 0 3

Related

Find duplicates in comma-separated string data using R

I am currently trying to run multiple data using loop, and ended up with the results below. Each line corresponds to the output from one data that has been filtered.
I am using this code to get the results below.
output <- print(paste(data.final$Peptide, collapse = ','))
It was previously given the form of table with one of the column name "Peptide", so I am pasting the peptide into a string separated comma, as shown here :
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
I would like to find the number of duplicates from each comma-separated strings (eg. LPPAYTNSF) between the lines.
Is there anyways to do this?

If I understand you well, there is no reason to first collapse them to a string. Why not just get the counts from your data.frame? Any n > 1 has duplicates.
data.final %>% count(Peptide)
solution based on your string
table(unlist(strsplit(v, ",")))
results
ASFSTFKCY CVADYSVLY KIYSKHTPI LPFNDGVYF LPPAYTNSF LPSAYTNSF QSYGFQPTY RLFRKSNLK SANNCTFEY SAYTNSFTR TSNQVAVLY WMESEFRVY
8 8 8 8 6 2 6 8 8 2 8 8
WTAGAAAYY YLQPRTFLL YNSASFSTF YSSANNCTF
8 8 8 8
data
v <- "LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"

Not quite sure what your data are like so I've scribbled together some toy data to show how a regex and a non-regex tidyverse solution would work on your task of "find[ing] the number of duplicates from each comma-separated strings":
Data:
x <- c("a,c,x,a,f,s,w,s,b,n,x,q",
"A,B,B,X,B,Q")
A regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# create new column with duplicated values:
mutate(dups = str_extract_all(x, "([A-Za-z])+(?=.*\\1)"))
x dups
1 a,c,x,a,f,s,w,s,b,n,x,q a, x, s
2 A,B,B,X,B,Q B, B
NB: the + in the regex pattern makes sure you can use this solution also for longer-than-one-character alphabetic comma-separated strings
Alternatively, a non-regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# make column with unique ID for each string:
mutate(stringID = row_number()) %>%
# separate values into rows:
separate_rows(x) %>%
# for each combination of `stringID` and `x`...
group_by(stringID, x) %>%
# ...count the number of tokens:
summarise(N = n()) %>%
# show only the duplicated values:
filter(N > 1)
# A tibble: 4 × 3
# Groups: stringID [2]
stringID x N
<int> <chr> <int>
1 1 a 2
2 1 s 2
3 1 x 2
4 2 B 3

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!

This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

How do I Remove Data From Non-Scaled Dataframe against a Scaled One

I'm using R right now where I'm scaling the original data, removing all outliers with a Z-Score of 3 or more, and then filtering out the unscaled data so that it contains only non-outliers. I want to be left with a data frame that contains non-scaled numbers after removing outliers. These were my steps:
Steps
1. Create two data frames (x, y) of the same data
2. Scale x and leave y unscaled.
3. Filter out all rows that have greater than 3 Z-Score in x
4. Currently, for example, x may have 95,000 rows while y still has 100,000
5. Truncate y based on a unique column called Row ID, which I made sure was unscaled in x. This unique column will help me match up the remaining rows in x and the rows in y.
6. y should now have the same number of rows as x, but with the data unscaled. x has the scaled data.
At the moment I can't get the data to be unscaled. I tried using the unscale method or data frame comparison tools but R complains I cannot work on data frames of two different sizes. Is there a workaround to this?
Tries
I've tried dataFrame <- dataFrame[dataFrame$Row %in% remainingRows] but that left nothing in my data frame.
I would also provide data, but it has sensitive information, so any data frame will do so long as it has a unique row ID that won't change during scaling.

If I understood correctly what you want to do, I'm suggesting a different approach. You could use two data.frames for that, but if you use the dplyrpackage, you can do everything within a single line of code ... and presumably faster as well.
First I'm generating a data.frame with 100k rows, which has an ID column (just 1:100000 sequence) and a value (random numbers).
Here's the code:
library(dplyr)
#generate data
x <- data.frame(ID=1:100000,value=runif(100000,max=100)*runif(10000,max=100))
#take a look
> head(x)
ID value
1 1 853.67941
2 2 632.17472
3 3 3089.60716
4 4 8448.89408
5 5 5307.75684
6 6 19.07485
To filter out the outliers, I'm using a dplyr pipe, where I chain multiple operations together with the pipe (%>%) operator. First calculate the zscore, then filter the observations with a zscore bigger than three, and finally drop the zscore column again to go back to your original format (of course you can keep it as well):
xclean <- x %>% mutate(zscore=(value-mean(value)) / sd(value)) %>%
filter(zscore < 3) %>% select(-matches('zscore'))
If you look at the rows, you'll see that the filtering worked
> cat('Rows of X:',nrow(x),'- Rows of xclean:',nrow(xclean))
Rows of X: 100000 - Rows of xclean: 99575
while the data looks like the original data.frame:
> head(xclean)
ID value
1 1 853.67941
2 2 632.17472
3 3 3089.60716
4 4 8448.89408
5 5 5307.75684
6 6 19.07485
Finally, you can see that observations have been filtered out by comparing the IDs of the two data.frames:
> head(x$ID[!is.element(x$ID,xclean$ID)],50)
[1] 68 90 327 467 750 957 1090 1584 1978 2106 2306 3415 3511 3801 3855 4051
[17] 4148 4244 4266 4511 4875 5262 5633 5944 5975 6116 6263 6631 6734 6773 7320 7577
[33] 7619 7731 7735 7889 8073 8141 8207 8966 9200 9369 9994 10123 10538 11046 11090 11183
[49] 11348 11371
EDIT:
Of course, the 2 data frames version is also possible:
y <- x
# calculate zscore
x$value <- (x$value - mean(x$value))/sd(x$value)
#subset y
y <- y[x$value<3,]
# initially 100k rows
> nrow(y)
[1] 99623
Edit2:
Accounting for multiple value columns:
#generate data
set.seed(21)
x <- data.frame(ID=1:100000,value1=runif(100000,max=100)*runif(10000,max=100),
value2=runif(100000,max=100)*runif(10000,max=100),
value3=runif(100000,max=100)*runif(10000,max=100))
> head(x)
ID value1 value2 value3
1 1 2103.9228 5861.33650 713.885222
2 2 341.8342 3940.68674 578.072141
3 3 5346.2175 458.07089 1.577347
4 4 400.1950 5881.05129 3090.618355
5 5 7346.3321 4890.56501 8989.248186
6 6 5305.5105 38.93093 517.509465
The dplyr solution:
# make sure you got a recent version of dplyr
> packageVersion('dplyr')
[1] ‘0.7.2’
# define zscore function:
zscore <- function(x){(x-mean(x))/sd(x)}
# select variables (could also be manually with c())
vars_to_process <- grep('value',colnames(x),value=T)
# calculate zscores and filter
xclean <- x %>% mutate_at(.vars=vars_to_process, .funs=funs(ZS = zscore(.))) %>%
filter_at(vars(matches('ZS')),all_vars(.<3)) %>%
select(-matches('ZS'))
> nrow(xclean)
[1] 98832
Now the solution without dplyr (instead of using 2 dataframes, I'll generate a boolean index based on x:
# select variables
vars_to_process <- grep('value',colnames(x),value=T)
# create index ZS < 3
ix <- apply(x[vars_to_process],2,function(x) (x-mean(x))/sd(x) < 3)
#filter rows
xclean <- x[rowSums(ix) == length(vars_to_process),]
> nrow(xclean)
[1] 98832

R: Subset data frame based on multiple values for multiple variables

I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
The date, ID, start, and end time from records are then:
[1] "01-04-2009" "599091" "12:00" "17:21"
Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
When I try to subset df1 based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!

If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!

One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9

Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick

You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19

Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R:How to intersect list of dataframes and specifc column - r

Related

Find duplicates in comma-separated string data using R

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

How do I Remove Data From Non-Scaled Dataframe against a Scaled One

R: Subset data frame based on multiple values for multiple variables

R sum of rows for different group of columns that start with similar string

Categories

Resources