Delete rows with value frequencies lesser than x in R - r

I got a data frame in R like the following:
V1 V2 V3
1 2 3
1 43 54
2 34 53
3 34 51
3 43 42
...
And I want to delete all rows which value of V1 has a frequency lower then 2. So in my example the row with V1 = 2 should be deleted, because the value "2" only appears once in the column ("1" and "3" appear twice each).
I tired to add a extra column with the frequency of V1 in it to delete all rows where the frequency is > 1 but with the following I only get NAs in the extra column.
data$Frequency <- table(data$V1)[data$V1]
Thanks

You can try this:
library(dplyr)
df %>% group_by(V1) %>% filter(n() > 1)

You can also consider using data.table. We first count the occurence of each value in V1, then filter on those occurences being more than 1. Finally, we remove our count-column as we no longer need it.
library(data.table)
setDT(dat)
dat2 <- dat[,n:=.N,V1][n>1,,][,n:=NULL]
Or even quicker, thanks to RichardScriven:
dat[, .I[.N >= 2], by = V1]
> dat2
V1 V2 V3
1: 1 2 3
2: 1 43 54
3: 3 34 51
4: 3 43 42

With this you do not need to load a library
res<-data.frame(V1=c(1,1,2,3,3,3),V2=rnorm(6),V3=rnorm(6))
res[res$V1%in%names(table(res$V1)>=2)[table(res$V1)>=2],]

Related

How to update a cell in R coresponding to a value in the first column row in R Programming

How to change a value in a cell along the same row in r programming?
I have a text delimited file and narrowed the columns to V1 and v2.
V1 V2
123 23
133 44
222 55
data2 <- data[-c(-1,-2)]
Now I want to change the value in the second column cell for a specific number let's say 133.
Basically for 133 in column v1 I want the cell adjacent to it to be changed to 999. How do I do that?
I have tried:
data4 <- data2[data2$v1==133,V2] <-999
But no luck with it.
So, in your code V1was in lowercase, and R is case sensitive. Also V2 wasn't in brackets.
Code
data <- data.frame(V1 = c(123,133,222), V2 = c(23,44,55))
data[data$V1==133,"V2"] <- 999
data
Output
V1 V2
1 123 23
2 133 999
3 222 55
We could use ifelse:
library(dplyr)
df %>%
mutate(V2 = ifelse(V1 == 133, 999, V2))
V1 V2
1 123 23
2 133 999
3 222 55

In R, How can I filter only those rows in which the value for colmun V6 appears exactly 2 times?

How can I filter in R only those row in which the value for Column V6 appears exactly 2 times?
my dataseta date:
I try:
library(dplyr)
df <- as.data.frame(date)
df1 <- subset(df,duplicated(V6))
but it does not work.
You can use a contingency table to get the value counts. Here's some example code.
# Make some dummy data (only 8 and 2 appear exactly twice in this example)
df <- data.frame(V1=1:10,
V2=11:10,
V6=c(1,2,8,3,4,3,2,3,8,7))
# Get table of counts for column "V6"
tab <- table(df$V6)
# Get values that appear exactly twice
twice <- as.numeric(names(tab)[tab == 2])
# Filter the data frame based on these values
df <- df[df$V6 %in% twice,]
Output:
V1 V2 V6
2 2 10 2
3 3 11 8
7 7 11 2
9 9 11 8

conditional count throughout each row using r

I tried every solution but my problem is still there. I have a big df (20rows*400cols) - for each row I want to count how many columns have a value of more than 16.
The first col is factor and the rest of the columns are integers.
my df:
col1 col2 col3 col4
abc 2 16 17
def 4 2 4
geh 50 60 73
desired output should be:
col1 col2 col3 col4 count
abc 2 16 17 1
def 4 2 4 0
geh 50 60 73 3
I tried df$morethan16 <- rowSums(df[,-1] > 16) but then I get NA in the count column.
We may need na.rm to take care of NA elements as >/</== returns NA wherever there are NA elements
df$morethan16 <- rowSums(df[,-1] > 16, na.rm = TRUE)
If we still get NA, check the class of the columns. The above code works only if the columns are numeric. Convert to numeric class automatically with type.convert (based on the values of the column)
df <- type.convert(df, as.is = TRUE)
check the structure
str(df)
If it is still not numeric, some values in the column may be character elements that prevents it from conversion to numeric. Force the columns to numeric with as.numeric. If those are factor columns, do as.character first
df[-1] <- lapply(df[-1], function(x) as.numeric(as.character(x)))
Here is another option using crossprod
df$count <- c(crossprod(rep(1, ncol(df[-1])), t(df[-1] > 16)))
which gives
col1 col2 col3 col4 count
1 abc 2 16 17 1
2 def 4 2 4 0
3 geh 50 60 73 3

Creating a new variable in R from two existing ones

My apologies if this is a basic question. I'm new to R.
I have a dataset, DAT, which has 3 variables: ID, V1 and V2. Unfortunately, V2 data are missing for many cases. I want to create a new variable, V3. I want V3 to have the same values as V2, but for any case that has a missing value for V2, I want V3 to take the value of V1 instead. What is the most efficient way to do this in R?
One approach using the dplyr package.
# Step 1: Load verb-like data wrangling package.
library(dplyr)
# Step 2: Create some data.
df <- data.frame(ID=1:5, V1 = 11:15, V2 = c(31:33, NA, NA))
ID V1 V2
1 11 31
2 12 32
3 13 33
4 14 NA
5 15 NA
# Step 3: Create a variable V3 using your criteria
df <- mutate(df, V3 = if_else(is.na(V2), V1, V2))
ID V1 V2 V3
1 11 31 31
2 12 32 32
3 13 33 33
4 14 NA 14
5 15 NA 15
Using the data.table package would probably be more efficient if you have a big data frame.
You can also use the ifelse statement.
DAT$V3 <- ifelse(is.na(DAT$V2), DAT$V1, DAT$V2)
Reads as if V2 is blank then use V1, otherwise use the data in V2.

Moving rows from one dataframe to another based on a matching column

I'm very sorry for asking this question, because I saw something similar in the past but I couldn't find it (so duplication will be understandable).
I have 2 data frames, and I want to move all my (matching) customers who appears in the 2 data frames into one of them. Please pay attention that I want to add the entire row.
Here is an example:
# df1
customer_ip V1 V2
1 15 20
2 12 18
# df2
customer_ip V1 V2
2 45 50
3 12 18
And I want my new data frames to look like:
# df1
customer_ip V1 V2
1 15 20
2 12 18
2 45 50
# df2
customer_ip V1 V2
3 12 18
Thank you in advance!
This does it.
df1<-rbind(df1,df2[df2$customer_ip %in% df1$customer_ip,])
df2<-df2[!(df2$customer_ip %in% df1$customer_ip),]
EDIT: Gaurav & Sotos got here before me whilst I was writing with essentially the same answer, but I'll leave this here as it shows the code without the redundant 'which'
This should do the trick:
#Add appropriate rows to df1
df1 <- rbind(df1, df2[which(df2$customer_ip %in% df1$customer_ip),])
#Remove appropriate rows from df2
df2 <- df2[-which(df2$customer_ip %in% df1$customer_ip),]

Resources