Remove rows which cells do not match the column class - r

So, similarly to removing NA values, I need to remove rows which cell values do not match the column class. For example
So for this example, I want to be able to remove rows of Andy, Aaron and Dorothy. For Andy's Gender it is 12 but it should only be "Male" or "Female". AS for Aaron, Status is NA so i would like to remove that too.
And lastly for Dorothy, her age is "abc" instead of a numeric.
Name Age Gender Status
Tom 12 Male Married
Dom 41 Male Single
Kelvin 23 Male Married
Tim 12 Male Single
Andy 42 12 Single
Aaron 12 Male NA
Dorothy abc Female Married
Nathan 34 Male Single
sorry for the formatting im new to stackoverflow

For each column, there should be a class assigned to them. However, for this case there wasn't. Solution provided by Adam Quek was helpful!
For columns that are class() numeric, e.g. dat$Age <- as.numeric(as.character(dat$Age))
For columns that are class() factor, e.g. dat$Gender <- factor(dat$Gender, levels=c("Male", "Female"))
These codes above changes abnormalities to NA values and lastly na.exclude(dat) should do the work.

Related

Create weight node and edges lists from a normal dataframe in R?

I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!
One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().

Show proportion with multiple conditions in R

I have:
> dataframe
GENDER CITY NUMBER
Male NY 1
Female Paris 2
Male Paris 1
Female NY
Female NY 2
Male Paris 2
Male Paris
Male Paris 1
Female NY 2
Female Paris 1
And I would like to return the proportion of Male and Female in bomb city (then in NY) who has 2 as a third column (The DF is way longer that my example), knowing that there are empty rows in NUMBER column.
Technically speaking I want to show a proportion with two conditions (and more conditions in the future).
I tried:
prop.table(table(dataframe$GENDER, dataframe$CITY == 'NY' & dataframe$NUMBER == 2)
But this gives me the wrong results.
The xxpected output (or any that is close to this):
NY
Male 0
Female 20
Do you have any idea how I can get this?
The best would be to have a column per city
Use the environment data.table, that makes your life much more easier. It uses SQL syntax and its superfast in case your data grows up. The code should be:
library(data.table)
df <- data.table(yourdataframe)
df[, summary(GENDER), by = CITY]
The output should give you the count of each value

R - How can I find a duplicated line based in one Column and add extra text in that duplicated value?

I'am looking for a easy solution, instead of doing several steps.
I have a data frame with 36 variables with almost 3000 lines, one of vars is a char type with names. They must be unique. I need to find the rows with the same name, and the add "duplicated" in the text. I can't delete the duplicated because it is from a relational data base and I'll need that row ID for others operations.
I can find the duplicated rows and them rename the text manually. But that implies in finding the duplicated, record the row ID and them replace the text name manually.
Is there a way to automatically add the extra text to the duplicated names? I'am still new to R and have a hard time making auto condition based functions.
It would be something like this:
From this:
ID name age sex
1 John 18 M
2 Mary 25 F
3 Mary 19 F
4 Ben 21 M
5 July 35 F
To this:
ID name age sex
1 John 18 M
2 Mary 25 F
3 Mary - duplicated 19 F
4 Ben 21 M
5 July 35 F
Could you guys shed some light?
Thank you very much.
Edit: the comment about adding a column is probably the best thing to do, but if you really want to do what you're suggesting...
The duplicated function will identify duplicates. Then, you just need to use paste to apply the append.
df <- data.frame(
ID = 1:5,
name = c('John', 'Mary', 'Mary', 'Ben', 'July'),
age = c(18, 25, 19, 21, 35),
sex = c('M', 'F', 'F', 'M', 'F'),
stringsAsFactors = FALSE)
# Add "-duplicated" to every duplicated value (following Laterow's comment)
dup <- duplicated(df$name)
df$name[dup] <- paste0(df$name[dup], '-duplicated')
df
ID name age sex
1 1 John 18 M
2 2 Mary 25 F
3 3 Mary-duplicated 19 F
4 4 Ben 21 M
5 5 July 35 F

readr::read_csv(), empty strings as NA not working

I was trying to load a CSV file (readr::read_csv()) in which some entries are blank. I set the na="" in read_csv() but it still loads them as blank entries.
d1 <- read_csv("sample.csv",na="") # want to load empty string as NA
Where Sample.csv file can be like following:-
Name,Age,Weight,City
Sam,13,30,
John,35,58,CA
Doe,20,50,IL
Ann,18,45,
d1 should show me as following(using read_csv())
Name Age Weight City
1 Sam 13 30 NA
2 John 35 58 CA
3 Doe 20 50 IL
4 Ann 18 45 NA
First and fourth row of City should have NA (as shown above). But in actual its showing blank there.
Based on the comments and verifying myself, the solution was to upgrade to readr_0.2.2.
Thanks to fg nu, akrun and Richard Scriven

Locate and merge duplicate rows in a data.frame but ignore column order

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Resources