How To Remove NA Values Dependent On A Condition in R?

How To Remove NA Values Dependent On A Condition in R? - r

Let's say I have this data frame. How would I go about removing only the NA values associated with name a without physically removing them manually?
a 1 4
a 7 3
a NA 4
a 6 3
a NA 4
a NA 3
a 2 4
a NA 3
a 1 4
b NA 2
c 3 NA
I've tried using the function !is.na, but that removes all the NA values in the column ID1 for all the names. How would I specifically target the ones that are associated with name a?

You could subset your data frame as follows:
df_new <- df[!(df$name == "a" & is.na(df$ID1)), ]
This can also be written as:
df_new <- df[df$name != "a" | !is.na(df$ID1), ]

With dplyr:
library(dplyr)
df %>%
filter(!(name == "a" & is.na(ID1)))
Or with subset:
subset(df, !(name == "a" & is.na(ID1)))
Output
name ID1 ID2
1 a 1 4
2 a 7 3
3 a 6 3
4 a 2 4
5 a 1 4
6 b NA 2
7 c 3 NA
Data
df <- structure(list(name = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "c"), ID1 = c(1L, 7L, NA, 6L, NA, NA, 2L, NA, 1L, NA,
3L), ID2 = c(4L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 4L, 2L, NA)), class = "data.frame", row.names = c(NA,
-11L))

Related

Using dplyr to summarize summed proportions by groups

I've been trying to summarize data by multiple groups, where the new column should be a summary of the proportion of one column to another, by these groups. Because these two columns never both contain a value, their proportions cannot be calculated per row.
Below is an example.
By, P_Common and Number7 groups, I'd like the total N_count/A_count
structure(list(P_Common = c("B", "B", "C", "C", "D", "E", "E",
"F", "G", "G", "B", "G", "E", "D", "F", "C"), Number_7 = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L),
N_count = c(0L, 4L, 22L, NA, 7L, 0L, 44L, 16L, NA, NA, NA,
NA, NA, NA, NA, NA), A_count = c(NA, NA, NA, NA, NA, NA,
NA, NA, 0L, 4L, 7L, NA, 23L, 4L, 7L, 17L)), class = "data.frame", row.names = c(NA,
-16L))
P_Common Number_7 N_count A_count
B 1 0 NA
B 1 4 NA
C 1 22 NA
C 1 NA NA
D 2 7 NA
E 2 0 NA
E 2 44 NA
F 2 16 NA
B 1 NA 7
G 3 NA NA
E 1 NA 23
D 2 NA 4
F 1 NA 7
C 1 NA 17
In this example there'd be quite some 0 / NA values but that's okay, they can stay in, but overall it would become like
P_Common Number_7 Propo
B 1 0.571428571
C 1 1.294117647
D 2 1.75
... etc

You can do:
df %>%
group_by(P_Common, Number_7) %>%
summarise(Propo = sum(N_count, na.rm = T) / sum(A_count, na.rm = T))
P_Common Number_7 Propo
<chr> <int> <dbl>
1 B 1 0.571
2 C 1 1.29
3 D 2 1.75
4 E 1 0
5 E 2 Inf
6 F 1 0
7 F 2 Inf
8 G 3 0

R increment by 1 for every change in value column and restart the counter

I would like to find a way to do very similar to this question.
Increment by 1 for every change in column
But i want to restart the counter when var1 = c
using
df$var2 <- with(rle(as.character(df$var1)), rep(seq_along(values), lengths))*
results in column var 2
var1 var2 Should be
a 1 1
a 1 1
1 2 2
0 3 3
b 4 4
b 4 4
b 4 4
c 5 1
1 6 2
1 6 2

In data.table you can use rleid to get a run-length-id for var1 within each group.
library(data.table)
setDT(df)
df[, var2 := rleid(var1), by = cumsum(var1 == "c")]
df
# var1 var2
# 1: a 1
# 2: a 1
# 3: 1 2
# 4: 0 3
# 5: b 4
# 6: b 4
# 7: b 4
# 8: c 1
# 9: 1 2
#10: 1 2
and using dplyr
library(dplyr)
df %>%
group_by(group = cumsum(var1 == "c")) %>%
mutate(var2 = cumsum(var1 != lag(var1, default = first(var1))) + 1)
data
df <- structure(list(var1 = structure(c(3L, 3L, 2L, 1L, 4L, 4L, 4L,
5L, 2L, 2L), .Label = c("0", "1", "a", "b", "c"), class = "factor")),
class = "data.frame", row.names = c(NA, -10L))

We can use the OP's code with rle in base R with ave
df$var2 <- with(df, as.integer(ave(as.character(var1), cumsum(var1 == 'c'),
FUN = function(x) with(rle(x), rep(seq_along(values), lengths)))))
df$var2
#[1] 1 1 2 3 4 4 4 1 2 2
data
df <- structure(list(var1 = structure(c(3L, 3L, 2L, 1L, 4L, 4L, 4L,
5L, 2L, 2L), .Label = c("0", "1", "a", "b", "c"), class = "factor")),
class = "data.frame", row.names = c(NA,
-10L))

How to select lines which have equal values in columns and mantain this characteristics

I have a complete data frame of all cities from Brazil. I want just some predefined cities. I have a column with these predefined cities. Then I'd like to use all the columns from my data frame, but select only the lines which coincides the cities of column with all cities and the column with predefined cities.
data = read.csv(file="C:/Users/guilherme/Desktop/data.csv", header=TRUE, sep=";")
data
> AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 A 2 4 C 12 5
2 B 2 2 A 11 10
3 C 3 4 F 09 2
4 D 4 2
5 E 5 6
6 F 6 2
I want the following
> data
AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 C 3 4 C 12 5
2 A 2 4 A 11 10
3 F 6 2 F 09 2

You need merge -
merge(
data[, c("AllCities", "Year1990", "Year200")],
data[, c("PredefinedCities", "CharacCities1", "CharacCities2")],
by.x = "AllCities", by.y = "PredefinedCities"
)
AllCities Year1990 Year200 CharacCities1 CharacCities2
1 A 2 4 11 10
2 C 3 4 12 5
3 F 6 2 9 2
Note - Your data format is unusual. If you can, you should fix data source so that it gives you AllCities and PreferredCities tables separately or maybe even join them correctly before creating the csv file.
Data -
structure(list(AllCities = c("A", "B", "C", "D", "E", "F"), Year1990 = c(2L,
2L, 3L, 4L, 5L, 6L), Year200 = c(4L, 2L, 4L, 2L, 6L, 2L), PredefinedCities = c("C",
"A", "F", "", "", ""), CharacCities1 = c(12L, 11L, 9L, NA, NA,
NA), CharacCities2 = c(5L, 10L, 2L, NA, NA, NA)), .Names = c("AllCities",
"Year1990", "Year200", "PredefinedCities", "CharacCities1", "CharacCities2"
), class = "data.frame", row.names = c(NA, -6L))

data <- data[data$AllCities %in% data$PredefinedCities,]

Preparing data for Gephi

Greeting,
I would need to prepare data for network analysis in Gephi. I have data in the following format:
MY Data
And I need data in format (Where the values represent persons that are connected through the organization):
Required format
Thank you very much!

I think this code should do the job. It is not the best most elegant way of doing it, but it works :)
# Data
x <-
structure(
list(
Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")
),
.Names = c("Persons", "Organizations"),
class = "data.frame",
row.names = c(NA, -11L)
)
# This will merge n:n
edgelist <- merge(x, x, by = "Organizations")[,2:3]
# We don't want autolinks
edgelist <- subset(edgelist, Persons.x != Persons.y)
# Removing those that are repeated
edgelist <- unique(edgelist)
edgelist
#> Persons.x Persons.y
#> 2 1 3
#> 3 1 2
#> 4 3 1
#> 6 3 2
#> 7 2 1
#> 8 2 3
HIH
Created on 2018-01-03 by the reprex
package (v0.1.1.9000).

Starting with x:
structure(list(Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")), .Names = c("Persons", "Organizations"), class = "data.frame", row.names = c(NA,-11L))
Create a new data.frame with different names. Just convert Organizations to a factor and then use the numeric values:
> y=data.frame(Source=x$Persons, Target=as.numeric(as.factor(x$Organizations)))
> y
Source Target
1 1 1
2 1 2
3 1 5
4 2 6
5 2 1
6 2 5
7 2 3
8 3 4
9 3 3
10 3 1
11 3 5
For what it's worth, I'm pretty sure gephi can handle strings.

Remove the rows that have the same column A value but different column B value from df (but not vice-versa) in R

I’m trying to remove all the rows that have the same value in the "lan" column of my dataframe but different value for my "id" column (but not vice-versa).
Using an example dataset:
require(dplyr)
t <- structure(list(id = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), lan = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 1L,
7L), .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"),
value = c(0.22988498, 0.848989831, 0.538065821, 0.916571913,
0.304183372, 0.983348167, 0.356128559, 0.054102854, 0.400934593,
0.001026817, 0.488452667)), .Names = c("id", "lan", "value"
), class = "data.frame", row.names = c(NA, -11L))
t
I need to get rid of rows 1 and 10 because they have the same lan (a) but different id.
I've tried the following, without success:
a<-t[(!duplicated(t$id)),]
c<-a[duplicated(a$lan)|duplicated(a$lan, fromLast=TRUE),]
d<-t[!(t$lan %in% c$lan),]
Thanks for your help!

And an alternative using dplyr:
t2 <- t %>%
group_by(lan,id) %>%
summarise(value=sum(value)) %>%
group_by(lan) %>%
summarise(number=n()) %>%
filter(number>1) %>%
select(lan)
> t[!t$lan %in% t2$lan ,]
id lan value
2 2 b 0.84898983
3 2 c 0.53806582
4 3 d 0.91657191
5 3 d 0.30418337
6 4 e 0.98334817
7 4 e 0.35612856
8 4 e 0.05410285
9 4 f 0.40093459
11 4 g 0.48845267

You could use duplicated on "lan", to get the logical index of all elements that are duplicates, repeat the same with both columns together ('id', 'lan'), to get the elements not duplicated, check which of these elements are TRUE in both cases, negate, and subset.
indx1 <- with(t, duplicated(lan)|duplicated(lan,fromLast=TRUE))
indx2 <- !(duplicated(t[1:2])|duplicated(t[1:2],fromLast=TRUE))
t[!(indx1 & indx2),]
# id lan value
#2 2 b 0.84898983
#3 2 c 0.53806582
#4 3 d 0.91657191
#5 3 d 0.30418337
#6 4 e 0.98334817
#7 4 e 0.35612856
#8 4 e 0.05410285
#9 4 f 0.40093459
#11 4 g 0.48845267

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How To Remove NA Values Dependent On A Condition in R? - r

You could subset your data frame as follows: df_new <- df[!(df$name == "a" & is.na(df$ID1)), ] This can also be written as: df_new <- df[df$name != "a" | !is.na(df$ID1), ]

Related

Using dplyr to summarize summed proportions by groups

R increment by 1 for every change in value column and restart the counter

How to select lines which have equal values in columns and mantain this characteristics

Preparing data for Gephi

Remove the rows that have the same column A value but different column B value from df (but not vice-versa) in R

Categories

Resources