How to get rid of varying zeros in a column in R? - r

I have a df1:
Story Score
1 00678
2 0980
3 1120
4 00067
5 0091
6 123
7 234
8 0234
9 00412
and I would like to get rid of all beginning 0s to have a df2:
Story Score
1 678
2 980
3 1120
4 67
5 91
6 123
7 234
8 234
9 412

Assuming the Score column be text, you could use sub here:
df$Score <- sub("^0+", "", df$Score)
If you intend for Score to be treated and used as numbers, you also might be able to just cast it to numeric:
df$Score <- as.numeric(df$Score)

Related

How to create a data frame with all ordinal variables as columns and with frequencies of specific event

I have an ordinal data frame which has answers in the survey format. I want to convert each factor into a possible column so as to get them by frequencies of a specific event.
I have tried lapply, dplyr to get frequencies but failed
as.data.frame(apply(mtfinal, 2, table))
and
mtfinalf<-mtfinal %>%
group_by(q28) %>%
summarise(freq=n())
Expected Results in the form of data.frame
Frequency table with respect to q28's factors
Expected Results in the form of data.frame
q28 sex1 sex2 race1 race2 race3 race4 race5 race6 race7 age1 age2
2 0
3 0
4 23
5 21
Actual Results
$age
1 2 3 4 5 6 7
6 2 184 520 507 393 170
$sex
1 2
1239 543
$grade
1 2 3 4
561 519 425 277
$race7
1 2 3 4 5 6
179 21 27 140 17 1307
7
91
$q8
1 2 3 4 5
127 259 356 501 539
$q9
1 2 3 4 5
993 224 279 86 200
$q28
2 3 4 5
1034 533 94 121
This will give you a count of number of unique combinations. What you are asking is impossible since there would be overlaps between levels of sex, race and age.
mtfinalf<-mtfinal %>%
group_by(q28,age,race,sex) %>%
tally()

How to find a repeated sequence of numbers in a data frame

suppose I have the next data frame and what I want to do is to identify and remove certain observations.
The idea is to delete those observations with 4 or more similar numbers.
df<-data.frame(col1=c(12,34,233,3333,3333333,333333,555555,543,456,87,4,111111,1111111111,22,222,2222,22222,9111111,912,8688888888))
col1
1 12
2 34
3 233
4 3333
5 3333333
6 333333
7 555555
8 543
9 456
10 87
11 4
12 111111
13 1111111111
14 22
15 222
16 2222
17 22222
18 9111111
19 912
20 8688888888
So the final output should be:
col1
1 12
2 34
3 233
4 543
5 456
6 87
7 4
8 22
9 222
10 912
Another way of removing the desired values would be to directly filter 1111, 2222 etc., using grep() after converting the numbers to characters.
df$col1[-as.numeric(grep(paste(1111*(1:9), collapse="|"), as.character(df$col1), value=F))]
# [1] 12 34 233 543 456 87 4 22 222 912
Not the most efficient method, but it seems to return the desired result. Convert the vector into a string, split each individual character, use rle to look for repeating sequences, take the maximum and return TRUE if that max is less than 4.
df[sapply(strsplit(as.character(df$col1), ""),
function(x) max(rle(x)$lengths) < 4), , drop=FALSE]
col1
1 12
2 34
3 233
8 543
9 456
10 87
11 4
14 22
15 222
19 912
This method will include values like 155155 but exclude values like 555511 or 155551.

Omit rows in all data.frames in list, who don't share a common ID

I have a list, containing several data frames (only 2 in this example) of different sizes.
> myList
$`1`
ID values
1 1 100
2 2 200
3 3 240
4 4 403
5 5 212
6 6 432
7 7 423
8 8 123
9 9 543
10 10 982
$`2`
ID values
1 3 432
2 5 333
3 6 981
Now, I need to omit all row's in any of the data frames that do not share their ID in any of the other data frames. In this example, the result I'm looking for is:
> myList2
$`1`
ID values
3 3 240
5 5 212
6 6 432
$`2`
ID values
1 3 432
2 5 333
3 6 981
I've tried to use dplyr::setequal() but end up with FALSE: Different number of rows. I'd prefer a base solution if possible. Thanks in advance!
Reproducible code:
myList <- list(data.frame('ID' = c(1:10), 'values' = c(100,200,240,403,212,432,423,123,543,982)),data.frame('ID' = c(3,5,6), 'values' = c(432,333,981)))
One way via base R is to use Reduce(intersect, ...) to find the common IDs from all data frames in the list. We then use that to index the data frames.
ind <- Reduce(intersect, lapply(myList, '[[', 1))
lapply(myList, function(i) i[i$ID %in% ind,])
#[[1]]
# ID values
#3 3 240
#5 5 212
#6 6 432
#[[2]]
# ID values
#1 3 432
#2 5 333
#3 6 981

Subset data frame to only include the nth highest value of a column

I have a data frame abc and would like to subset the data frame to only include the row with nth highest value of a certain variable "z". I know a simple solution here would be:
library(plyr)
abc <- arrange(abc, z)
abc <- abc[n,]
But is there a way to do it without first ordering the data frame? I'm only asking because ordering seems to be expensive on larger data frames.
Here's an example df to work with:
x y z
1 2 1 111
2 3 2 112
3 4 3 113
4 5 4 114
5 6 5 115
6 7 6 116
7 8 7 117
8 9 8 118
9 10 9 119
10 11 10 120
You may try
library(dplyr)
n <- 7
slice(abc, rank(z)[n])
Or as #nicola commented, a base R option would be
abc[rank(abc$z)==n,]
Update
If you want the nth highest with rank increasing
slice(abc, rank(-z)[n])
# x y z
#1 5 4 114
abc[nrow(abc)-rank(abc$z)+1==n,]
# x y z
#4 5 4 114

in r, how can one trim or winsorize data by a factor

I'm trying to apply the winsor function at each level of a factor (subjects) in order to remove extreme cases. I can apply the winsor function to the entire column, but would like to do it within subject.
Subject RT
1 402
1 422
1 155
1 460
2 283
2 224
2 346
2 447
3 415
3 161
3 1
3 343
Ideally, I'd like the output to be a vector containing the same number of rows as the input but with outliers (e.g. the second last value of Subject 3) to be removed and replaced as per the winsor function.
you are looking for the ?by function
# for example:
by(myDF, myDF$Subject, winsor(myDF$RT))
However, using data.table (instead of data.frame) might be better suited for you
### broken down step by step:
library(data.table)
myDT <- data.table(myDF)
myDT[, winsorResult := winsor(RT), by=Subject]
library(psych)
transform(dat,win = ave(RT,Subject,FUN=winsor))
Subject RT win
1 1 402 402.0
2 1 422 422.0
3 1 155 303.2
4 1 460 437.2
5 2 283 283.0
6 2 224 259.4
7 2 346 346.0
8 2 447 386.4
9 3 415 371.8
10 3 161 161.0
11 3 1 97.0
12 3 343 343.0

Resources