how to remove part of a string without interrupting a data frame? - r

I have a data looks like this but way much bigger
df<- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
as an example, I am trying to remove -1 from all strings of the first column
I can do this with
as.data.frame(str_remove_all(df$names, "-1"))
the problem is that it will remove all other columns as well.
I dont want to split the data and merge again because I am afraid I Make a mismatch
Is there anyway without interrupting, just getting raid of specific strings?
for instance the output should looks like this
names Col col2
bests 1 2
trible 2 4
crazy NA 5
cool 4 7
nonsense 47 9
Mean 294 9
Lose 2 0
Try 1 2
Trified 3 3

Using gsub, escape the special \\-, and $ for end of string.
transform(df, names=gsub('\\-1$', '', names))
# names Col col2
# 1 bests 1 2
# 2 trible 2 4
# 3 crazy NA 5
# 4 cool 4 7
# 5 nonsense 47 9
# 6 Mean 294 9
# 7 Lose 2 0
# 8 Trye 1 2
# 9 Trified 3 3
Data:
df <- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))

Using stringr package,
df$names = str_remove_all(df$names, '-1')
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3

We could use trimws from base R
df$names <- trimws(df$names, whitespace = "-\\d+")
-output
> df
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3

Related

How to replace NA's with specific data? [duplicate]

This question already has answers here:
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 11 months ago.
I have a df1:
ID Number
3 9
4 10
2 3
1 NA
8 5
6 4
9 NA
I would like to change the NAs to 1's, like below.
ID Number
3 9
4 10
2 3
1 1
8 5
6 4
9 1
With base R
df$Number[is.na(df$Number)] <- 1
Or with dplyr
library(dplyr)
df %>%
mutate(Number = ifelse(is.na(Number), 1, Number))
Or with data.table:
library(data.table)
setnafill(df, cols="Number", fill=1)
Output
ID Number
1 3 9
2 4 10
3 2 3
4 1 1
5 8 5
6 6 4
7 9 1
Data
df <- structure(list(ID = c(3L, 4L, 2L, 1L, 8L, 6L, 9L), Number = c(9L,
10L, 3L, NA, 5L, 4L, NA)), class = "data.frame", row.names = c(NA,
-7L))

R: Sorting columns based on partial match of column names with row names

I have a data frame that can be simplified to look like this (included the dput at the end):
T2_KL_21 A1_LC_11 W3_FA_22 RR_BI_12 PL_EW_12 RT_LC_22 YU_BI_21
FA 1 2 3 4 5 6 7
BI 1 2 3 4 5 6 7
KL 1 2 3 4 5 6 7
EW 1 2 3 4 5 6 7
LC 1 2 3 4 5 6 7
I would like to sort the columns so that they follow the order of the row names (based on partial match). It would then look like this:
W3_FA_22 RR_BI_12 YU_BI_21 T2_KL_21 PL_EW_12 A1_LC_11 RT_LC_22
FA 3 4 7 1 5 2 6
BI 3 4 7 1 5 2 6
KL 3 4 7 1 5 2 6
EW 3 4 7 1 5 2 6
LC 3 4 7 1 5 2 6
If more than one column name contains the string in the row names, they should be kept side by side, but the order does not matter.
I have already filtered the columns so that they all contain a match in the row names.
Here is the dput of the data frame:
structure(list(T2_KL_21 = c(1L, 1L, 1L, 1L, 1L), A1_LC_11 = c(2L,
2L, 2L, 2L, 2L), W3_FA_22 = c(3L, 3L, 3L, 3L, 3L), RR_BI_12 = c(4L,
4L, 4L, 4L, 4L), PL_EW_12 = c(5L, 5L, 5L, 5L, 5L), RT_LC_22 = c(6L,
6L, 6L, 6L, 6L), YU_BI_21 = c(7L, 7L, 7L, 7L, 7L)), .Names = c("T2_KL_21",
"A1_LC_11", "W3_FA_22", "RR_BI_12", "PL_EW_12", "RT_LC_22", "YU_BI_21"
), class = "data.frame", row.names = c("FA", "BI", "KL", "EW",
"LC"))
I have tried using pmatch, grep and match, with no success.
Any advice will be much appreciated! Thanks
We can loop through the rownames and grep to find the index of the column names that match, unlist and use that to arrange the columns
df1[unlist(lapply(gsub("\\d+", "", row.names(df1)), function(x) grep(x, names(df1))))]
#W3_FA_22 RR_BI_12 YU_BI_21 T2_KL_21 PL_EW_12 A1_LC_11 RT_LC_22
#FA 3 4 7 1 5 2 6
#BI 3 4 7 1 5 2 6
#KL 3 4 7 1 5 2 6
#EW 3 4 7 1 5 2 6
#LC 3 4 7 1 5 2 6

Identify rows with complete data in R by adding details in additional column [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 7 years ago.
For a sample dataframe:
df1 <- structure(list(id = structure(1:5, .Label = c("a", "b", "c",
"d", "e"), class = "factor"), cat = c(5L, 7L, 6L, 2L, 8L), dog = c(7L,
NA, 6L, 13L, 2L), sheep = c(NA, 6L, 3L, 6L, 2L), cow = c(2L,
10L, 8L, 9L, 1L), rabbit = c(5L, 3L, NA, 2L, 4L), pig = c(7L,
NA, 12L, 5L, NA)), .Names = c("id", "cat", "dog", "sheep", "cow",
"rabbit", "pig"), class = "data.frame", row.names = c(NA, -5L
))
I want to add an extra column 'complete.farm' to identify which rows have values in columns 'sheep' AND 'cow' AND 'pig'. Any rows with NAs in one or
more of these columns should get a 0 and rows with real values should get a 1.
If anyone could give me some advice on this, I would really appreciate it. I usually use complete cases to subset my dataframe, but this time, I only want to add this information in a column.
This seems to work:
> df1$complete.farm <- ifelse( !is.na(df1$pig) & !is.na(df1$sheep) & !is.na(df1$cow), 1,0)
> df1
id cat dog sheep cow rabbit pig complete.farm
1 a 5 7 NA 2 5 7 0
2 b 7 NA 6 10 3 NA 0
3 c 6 6 3 8 NA 12 1
4 d 2 13 6 9 2 5 1
5 e 8 2 2 1 4 NA 0
ifelse is vectorised so you just mention the condition on the first argument with 1 as the confirmed and 0 the non-confirmed.
Another (simpler) way as per #thelatemail 's comment below:
df1$col <- as.numeric(complete.cases(df1[c("sheep","cow","pig")]))
> df1
id cat dog sheep cow rabbit pig complete.farm col
1 a 5 7 NA 2 5 7 0 0
2 b 7 NA 6 10 3 NA 0 0
3 c 6 6 3 8 NA 12 1 1
4 d 2 13 6 9 2 5 1 1
5 e 8 2 2 1 4 NA 0 0

R Aggregate and count of not null

I have the following data table
PIECE SAMPLE QC_CODE
1 1 1
2 1 NA
3 2 2
4 2 4
5 2 NA
6 3 6
7 3 3
8 3 NA
9 4 6
10 4 NA
and I would like to count the number of qc_code in each sample and return an output like this
SAMPLE SAMPLE_SIZE QC_CODE_COUNT
1 2 1
2 3 2
3 3 2
4 2 1
Where sample size is the count of pieces in each sample, and qc_code_count is the count of al qc_code that are no NA.
How would I go about this in R
You can try
library(dplyr)
df1 %>%
group_by(SAMPLE) %>%
summarise(SAMPLE_SIZE=n(), QC_CODE_UNIT= sum(!is.na(QC_CODE)))
# SAMPLE SAMPLE_SIZE QC_CODE_UNIT
#1 1 2 1
#2 2 3 2
#3 3 3 2
#4 4 2 1
Or
library(data.table)
setDT(df1)[,list(SAMPLE_SIZE=.N, QC_CODE_UNIT=sum(!is.na(QC_CODE))), by=SAMPLE]
Or using aggregate from base R
do.call(data.frame,aggregate(QC_CODE~SAMPLE, df1, na.action=NULL,
FUN=function(x) c(SAMPLE_SIZE=length(x), QC_CODE_UNIT= sum(!is.na(x)))))
data
df1 <- structure(list(PIECE = 1:10, SAMPLE = c(1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L), QC_CODE = c(1L, NA, 2L, 4L, NA, 6L, 3L, NA,
6L, NA)), .Names = c("PIECE", "SAMPLE", "QC_CODE"), class = "data.frame",
row.names = c(NA, -10L))

Remove duplicated 2 columns permutations

I can't find a good title for this question so feel free to edit it please.
I have this data.frame
section time to from
1 a 9 1 2
2 a 9 2 1
3 a 12 2 3
4 a 12 2 4
5 a 12 3 2
6 a 12 3 4
7 a 12 4 2
8 a 12 4 3
I want to remove duplicated rows that have the same to and from simultaneously, without computing permutations of the 2 columns: e.g (1,2) and (2,1) are duplicated.
So final output would be:
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
I have a solution by constructing a new column key e.g
key <- paste(min(to,from),max(to,from))
and remove duplicated key using duplicated, but I think this is dirty solution.
here the dput of my data
structure(list(section = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "a", class = "factor"), time = c(9L, 9L, 12L,
12L, 12L, 12L, 12L, 12L), to = c(1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L), from = c(2L, 1L, 3L, 4L, 2L, 4L, 2L, 3L)), .Names = c("section",
"time", "to", "from"), row.names = c(NA, -8L), class = "data.frame")
mn <- pmin(s$to, s$from)
mx <- pmax(s$to, s$from)
int <- as.numeric(interaction(mn, mx))
s[match(unique(int), int),]
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
Credit for the idea goes to this question: Remove consecutive duplicates from dataframe and specifically #MatthewPlourde's answer.
You can try using sort within the apply function to order the combinations.
mydf[!duplicated(t(apply(mydf[3:4], 1, sort))), ]
# section time to from
# 1 a 9 1 2
# 3 a 12 2 3
# 4 a 12 2 4
# 6 a 12 3 4

Resources