Subsetting rows in R - r

I have a huge data set in the following format:
ID Interaction Interaction_number
1 abc 1
1 xyz 2
1 pqr 3
1 ced 0
2 ab 0
2 efg 1
3 asdf 2
3 fgh 3
3 abc 0
4 sql 1
4 ghj 2
5 poi 2
6 pqr 1
Now I want to extract all the ID data where there is interaction_number as 0. for eg:
ID Interaction Interaction_number
1 abc 1
1 xyz 2
1 pqr 3
1 ced 0
2 ab 0
2 efg 1
3 asdf 2
3 fgh 3
3 abc 0
Its a huge dataset. I need to extract it using R.
I tried using the sqldf function.
x<-sqldf("select * from data where data$ID in (select data$ID from data where data$Interaction_number ==0)")
But the function didnt work. I was looking to add a flagging column ( 1 for all IDs where there is interaction_number 0) and then subset those rows. But I cant figure out exactly how to do.
Can we create the data frame of the ID's and then using that data frame, we can use subset to get all the rows?
Please help.
Thank You

I suggest using data.table package. Then you could obtain your result. Say your data is in data.frame df. Then
library(data.table)
dt <- data.table(df, key = 'ID')
tmp <- dt[, list(condition = any(Interaction_number == 0)), by = ID]
res <- dt[tmp[condition == TRUE, list(ID)]]

Use this
sqldf("SELECT * FROM data WHERE ID IN (SELECT ID FROM data WHERE Interaction_number=0)")
You do not need the double equal in your test, and do not use data$ID and such to refer to the data columns in the SQL expression (you can use data.ID but it is unnecessary to use the dataframe name in this case).
It may be helpful to read up on SQL before using this function much. Keep in mind that what it will do is turn all your referenced dataframes into tables using the same name as the dataframe, and all of the columns into fields using the same name as the columns. Thus in this case, we are querying a table named data with fields named ID, Interaction, and Interaction_number.

We can do this with dplyr. Group the 'data' by 'ID', and filter if there is any 0 values in the 'Interaction_number'.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(any(!Interaction_number))
# ID Interaction Interaction_number
# (int) (chr) (int)
#1 1 abc 1
#2 1 xyz 2
#3 1 pqr 3
#4 1 ced 0
#5 2 ab 0
#6 2 efg 1
#7 3 asdf 2
#8 3 fgh 3
#9 3 abc 0
Or using ave from base R
df1[with(df1, ave(!Interaction_number, ID, FUN=any)),]
Or this can be done without any group by
df1[df1$ID %in%subset(df1, !Interaction_number)$ID,]

Related

Dataframe NA conversion to specific items

I have a data frame like;
dataframe <- data.frame(ID1=c(NA,2,3,1,NA,2),ID2=c(1,2,3,1,2,2))
Now I want to convert the NA value to the valuable which is the same to the next column valuable like;
dataframe <- data.frame(ID1=c(1,2,3,1,2,2),ID2=c(1,2,3,1,2,2))
I think I should use the if function, but I want use %>% for simplification.
Please teach me.
An ifelse solution
dataframe <- within(dataframe, ID1 <- ifelse(is.na(ID1),ID2,ID1))
such that
> dataframe
ID1 ID2
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 2 2
A straightforward solution is to find out NA values in ID1 and replace them with corresponding values from ID2.
inds <- is.na(dataframe$ID1)
dataframe$ID1[inds] <- dataframe$ID2[inds]
However, since you want a solution with pipes you can use coalesce from dplyr
library(dplyr)
dataframe %>% mutate(ID1 = coalesce(ID1, ID2))
# ID1 ID2
#1 1 1
#2 2 2
#3 3 3
#4 1 1
#5 2 2
#6 2 2
A dplyr (using %>%) solution:
sanitized <- dataframe %>%
mutate(ID1 = ifelse(is.na(ID1), ID2, ID1))

How do i Count number of Values in a column and write it in a new column in R? [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 3 years ago.
Suppose there is a column with values
Website
Abc
Abc
Abc
Xyz
Xyz
Pqr
Uvw
Now i want to count how many times Abc or other names is in the column and write the count corresponding in the next column .
Website Total
Abc 3
Abc 3
Abc 3
Xyz 2
Xyz 2
Pqr 1
Uvw 1
Can a function be created Without manually counting each website?
1) ave Using the data shown reproducibly in the Note at the end, we can use ave to apply length to each group:
transform(DF, Count = ave(seq_along(Website), Website, FUN = length))
giving:
Website Count
1 Abc 3
2 Abc 3
3 Abc 3
4 Xyz 2
5 Xyz 2
6 Pqr 1
7 Uvw 1
2) aggregate or without duplicates:
aggregate(list(Count = 1:nrow(DF)), DF["Website"], length)
giving:
Website Count
1 Abc 3
2 Pqr 1
3 Uvw 1
4 Xyz 2
3) table Another approach is to create a table rather than a data.frame:
table(DF)
giving:
DF
Abc Pqr Uvw Xyz
3 1 1 2
4) xtabs or we can use xtabs:
xtabs(DF)
giving:
Website
Abc Pqr Uvw Xyz
3 1 1 2
Note
The input in reproducible form:
Lines <- "Website
Abc
Abc
Abc
Xyz
Xyz
Pqr
Uvw"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)
One option with tidyverse is add_count
library(dplyr)
df1 %>%
add_count(Website)
# A tibble: 7 x 2
# Website n
# <chr> <int>
#1 Abc 3
#2 Abc 3
#3 Abc 3
#4 Xyz 2
#5 Xyz 2
#6 Pqr 1
#7 Uvw 1

Dropping common rows in two different dataframes

I am a beginner using R. I have two different dataframes like the image called df-1 and df-2. I want to combine two dataframes and drop common rows. (Or I want to removal common rows and want to remain unique ID of rows.
Therefore, What I want to make is like df-3.
A merge is not appropriate because I don't need common rows.
df-1
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 B67 200302034466 1 20031204 3 1
3 C15 200302034455 1 20031223 3 1
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-2
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 K99 200402034466 1 20041204 2 3
3 Z75 200502034455 2 20021222 1 6
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-3
ID NUMBER FORM DATE CD AD
1 B67 200302034466 1 20031204 3 1
2 C15 200302034455 1 20031223 3 1
3 K99 200402034466 1 20041204 2 3
4 Z75 200502034455 2 20021222 1 6
Use rbind to merge df1 and df2 and then selecet unique values
df3 <- unique(rbind(df1,df2))
Can you just use unique on df3 to keep only unique rows? Or, in one line,
df3 <- unique(merge(df1, df2))
Also, avoid using brackets when naming variables - df(1) looks like "apply function df to 1"
If I'm interpreting your question correctly you want a dataframe with records that are present in only one of the original dataframes.
With dplyr:
library(dplyr)
df1_anti <- anti_join(df1, df2)
df2_anti <- anti_join(df2, df1)
df3 <- bind_rows(df1_anti, df2_anti)
df1_anti contains rows present in df1 but not in df2.
df2_anti contains rows present in df2 but not in df1.
df3 is the UNION the two dfs.

How to integrate set of vector in multiple data.frame into one without duplication?

I have position index vector in data.frame objects, but in each data.frame object, the order of position index vector are very different. However, I want to integrate/ merge these data.frame object object in one common data.frame with very specific order and not allow to have duplication in it. Does anyone know any trick for doing this more easily? Can anyone propose possible approach how to accomplish this task?
data
v1 <- data.frame(
foo=c(1,2,3),
bar=c(1,2,2),
bleh=c(1,3,0))
v2 <- data.frame(
bar=c(1,2,3),
foo=c(1,2,0),
bleh=c(3,3,4))
v3 <- data.frame(
bleh=c(1,2,3,4),
foo=c(1,1,2,0),
bar=c(0,1,2,3))
initial output after integrating them:
initial_output <- data.frame(
foo=c(1,2,3,1,2,0,1,1,2,0),
bar=c(1,2,2,1,2,3,0,1,2,3),
bleh=c(1,3,0,3,3,4,1,2,3,4)
)
remove duplication
rmDuplicate_output <- data.frame(
foo=c(1,2,3,1,0,1,1),
bar=c(1,2,2,1,3,0,1),
bleh=c(1,3,0,3,4,1,2)
)
final desired output:
final_output <- data.frame(
foo=c(1,1,1,1,2,3,0),
bar=c(0,1,1,1,2,2,3),
bleh=c(1,1,2,3,3,0,4)
)
How can I get my final desired output easily? Is there any efficient way for doing this sort of manipulation for data.frame object? Thanks
You could also use use mget/ls combo in order to get your data frames programmatically (without typing individual names) and then use data.tables rbindlist and unique functions/method for great efficiency gain (see here and here)
library(data.table)
unique(rbindlist(mget(ls(pattern = "v\\d+")), use.names = TRUE))
# foo bar bleh
# 1: 1 1 1
# 2: 2 2 3
# 3: 3 2 0
# 4: 1 1 3
# 5: 0 3 4
# 6: 1 0 1
# 7: 1 1 2
As a side note, it usually better to keep multiple data.frames in a single list so you could have better control over them
We can use bind_rows from dplyr, remove the duplicates with distinct and arrange by 'bar'
library(dplyr)
bind_rows(v1, v2, v3) %>%
distinct %>%
arrange(bar)
# foo bar bleh
#1 1 0 1
#2 1 1 1
#3 1 1 3
#4 1 1 2
#5 2 2 3
#6 3 2 0
#7 0 3 4
Here is a solution:
# combine dataframes
df = rbind(v1, v2, v3)
# remove duplicated
df = df[! duplicated(df),]
# sort by 'bar' column
df[order(df$bar),]
foo bar bleh
7 1 0 1
1 1 1 1
4 1 1 3
8 1 1 2
2 2 2 3
3 3 2 0
6 0 3 4

Frequency of Characters in Strings as columns in data frame using R

I have a data frame initial of the following format
> head(initial)
Strings
1 A,A,B,C
2 A,B,C
3 A,A,A,A,A,B
4 A,A,B,C
5 A,B,C
6 A,A,A,A,A,B
and the data frame I want is final
> head(final)
Strings A B C
1 A,A,B,C 2 1 1
2 A,B,C 1 1 1
3 A,A,A,A,A,B 5 1 0
4 A,A,B,C 2 1 1
5 A,B,C 1 1 1
6 A,A,A,A,A,B 5 1 0
to generate the data frames the following codes can be used to keep the number of rows high
initial<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100))
final<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100),A=rep(c(2,1,5),100),B=rep(c(1,1,1),100),C=rep(c(1,1,0),100))
What is the fastest way I can achieve this? Any help will be greatly appreciated
We can use base R methods for this task. We split the 'Strings' column (strsplit(...)), set the names of the output list with the sequence of rows, stack to convert to data.frame with key/value columns, get the frequency with table, convert to 'data.frame' and cbind with the original dataset.
cbind(df1, as.data.frame.matrix(
table(
stack(
setNames(
strsplit(as.character(df1$Strings),','), 1:nrow(df1))
)[2:1])))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
or we can use mtabulate after splitting the column.
library(qdapTools)
cbind(df1, mtabulate(strsplit(as.character(df1$Strings), ',')))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
Update
For the new dataset 'initial', the second method works. If we need to use the first method with the correct order, convert to factor class with levels specified as the unique elements of 'ind'.
df1 <- stack(setNames(strsplit(as.character(initial$Strings), ','),
seq_len(nrow(initial))))
df1$ind <- factor(df1$ind, levels=unique(df1$ind))
cbind(initial, as.data.frame.matrix(table(df1[2:1])))

Resources