How to remove the duplicate data from csv file?

How to remove the duplicate data from csv file? - r

I have data about baseball result in 2016.
Now, I want to remove the column that made tie score.
That is, I want to remove the column that has same value in $team1_score and $team2_score.
How can I use the function in r?
I just tried to use the following code, but it didn't work well.
Baseball2 <- Baseball[!duplicated(Baseball$team1_score)]
Please help me...!!

Here's an simple way to remove rows with tie-score:
(dat <- data.frame(Team1_Score= c(1,2,3), Team2_Score=c(2,3,3)))
Team1_Score Team2_Score
1 1 2
2 2 3
3 3 3
Use logical test to find which row has tie score:
tie <- dat$Team1_Score == dat$Team2_Score
tie
[1] FALSE FALSE TRUE
Use this result to select rows that are not tie:
dat[!tie, ]
Team1_Score Team2_Score
1 1 2
2 2 3

I understand you do not want to remove duplicates, but need to subset the dataframe discarding tied matches.
A very simple option using data.table:
library(data.table)
Baseball2 <- data.table(Baseball)
Baseball2 <- Baseball2[Team1_Score != Team2_Score,]

Related

How to find a total of row values in R

I am trying to find the total of rows that have a column value of 3 or 4. That being said, the first row has only one value of 3 so if I create a new column
currentdx_count1$TotalDiagnoses
That new column called TotalDiagnoses should only have a value of 1 under it for the first row. I have tried
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[2:32])
This doesn't give me what I need as expected because it literally sums up the whole row. That being said, is there an existing function that does what I want to do or will I have to make one? Could I specify more in rowSums for it to work as I need it to?
Thanks for any and all help.
Edit: I'm trying to adapt a method I use earlier in my script that works for a similar purpose
findtotal <- endsWith(names(currentdx_count1), 'Current')
findtotal <- lapply(findtotal, `>`, 2)
findtotal <- unlist(findtotal)
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
I get an error which I have never seen before (an error in view?!)
So I tried just this
findtotal <- endsWith(names(currentdx_count1), 'Current')
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
Gets me closer but it is finding the total count for each column separately which is not what I need. I want a single column to encompass counts for each SID.

You can compare the dataframe with the value of 3 or 4 and then use rowSums to count :
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[-1] == 3 |
currentdx_count1[-1] == 4)
currentdx_count1$TotalDiagnoses
#[1] 1 2 2 2 1 1 1 1 1 1 1 1 1 2

Excluding a number of answers from a R dataframe

I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?

We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4

You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4

You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?

Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])

You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])

Drop columns in a data.frame with conditions R

I am trying to be lazier than ever with R and I was wondering to know if there is a chance to drop columns from a data.frame by using a condition.
For instance, let's say my data.frame has 50 columns.
I want to drop all the columns that share each other
mean(mydata$coli)... = mean(mydata$coln) = 0
How would you write this code in order to drop them all at once? Because I use to drop columns with
mydata2 <- subset(mydata, select = c(vari, ..., varn))
Obviously not interesting because of the need of manual data checking.
Thank you all!

Something similar as #akrun using lapply
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)
mydata[lapply(mydata, mean)!=0]
# col2
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7

We can use colMeans to get the mean of all the columns as a vector, convert that to a logical index (!=0) and subset the dataset.
mydata[colMeans(mydata)!=0]
Or use Filter with f as mean. If the mean of a column is 0, it will be coerced to FALSE and all others as TRUE to filter out the columns.
Filter(mean, mydata)
data
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!

If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

R: Subset: Using whole dataframe except one column

I'd like to exclude one single column from a operation on a dataframe. Of course I could just replicate the dataframe without the column that I want to exclude, but this seems to be a workaround. There must be an easier way to subset I think.
So this example code should show what I am up to.
df<-data.frame(a=c(1:5),b=c(6:10),c=c(11:15))
# First subset: operate on a single column
mean(df[,1])
[1] 3
# Second subset: with a set of choosen columns
colMeans(df[,c(1,3)])
a c
3 13
# third subset: exclude column b from the operation (expected Output should be like the second subset)
colMeans(df[,!=2])
Error: unexpected '!=' in "colMeans(df[,!="
Any help would be very appreciated.

> colMeans(df[,-2])
a c
3 13

An alternative would be the %in% operator (which is handy if you want to use a few different named columns):
colMeans( df[ , ! colnames(df) %in% c("b") ])
#a c
#3 13

Try
colMeans(df[, -2])
## a c
## 3 13

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove the duplicate data from csv file? - r

I understand you do not want to remove duplicates, but need to subset the dataframe discarding tied matches. A very simple option using data.table: library(data.table) Baseball2 <- data.table(Baseball) Baseball2 <- Baseball2[Team1_Score != Team2_Score,]

Related

How to find a total of row values in R

Excluding a number of answers from a R dataframe

Drop columns in a data.frame with conditions R

create new dataframe based on 2 columns

R: Subset: Using whole dataframe except one column

Categories

Resources