Replace strings in data frame columns with integer in R - r

I have a data frame called 'foo':
foo <- data.frame("row1" = c(1,2,3,4,5), "row2" = c(1,2.01,3,"-","-"))
'foo' was uploaded from a different program as a CSV file and has two columns. one is a numerical data type and the other is a factor data type.
str(foo)
'data.frame': 5 obs. of 2 variables:
$ row1: num 1 2 3 4 5
$ row2: Factor w/ 4 levels "-","1","2.01",..: 2 3 4 1 1
Notice there are dashes, e.g. "-" , in foo$row2, which causes this column to be a factor. I want to replace the dashes with zeros, such that data.class(foo$row2) will return 'numerical'. The idea is to replace all dashes in each column so I can run numberical analyses on it with R.
What is the simplest way to do this in R?
Thanks,

Q: The idea is to replace all dashes in each column so I can run numerical analyses on it with R.
Use apply or sapply with sub
kk<-data.frame(apply(foo,2,function(x) as.numeric(sub("-",0,x))))
> kk
row1 row2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 0.00
> str(kk$row2)
num [1:5] 1 2.01 3 0 0
Or, you can use sapply
kk<-data.frame(sapply(names(foo),function(x)as.numeric(sub("-",0,foo[,x]))))
Update:
If you want just the second col, you don't need to use apply:foo$row2<- as.numeric(sub("-",0,foo[,2]))

Here is one simple way to do it. There might be a more elegant way, but this will work:
> foo <- data.frame("row1" = c(1,2,3,4,5), "row2" = c(1,2.01,3,"-","-"))
> levels(foo$row2)[levels(foo$row2)=="-"]<-0
> foo$row2<-as.numeric(as.character(foo$row2))
> class(foo$row2)
[1] "numeric"
> foo
row1 row2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 0.00

I would use ifelse() for this:
foo$row2 <- ifelse(foo$row2 == "-", 0, as.numeric(foo$row2))
you might also need to as as.character() to convert from factor to character

How about gsub...
as.numeric( gsub("-" , 0 , foo[,2] ) )
#[1] 1.00 2.01 3.00 0.00 0.00

Related

Using S3 object to analyse other data.frames - noob level question

This was an attempt to understand OOP - taking a tutorial in it now. I'm very new at R and I would consider deleting the question as it is not well formulated. I would not use it for learning or guidance
Part 1 - I want to build a class in S3 using datadump1111 data. I want to call the S3 object a50survey then I want to ouput stuff. This seems to work but I'm not sure I made a proper S3 class or the function is working like it normally should.
a50DATA <- datadump1111
inputdata <- sapply(a50DATA, function(x) t(sapply(x, table)))
a50survey <- sapply(inputdata, function(x) colSums(prop.table(x)))
print(a50survey)
class(a50survey)
a50survey$drugs
> a50survey$drugs
Not Tried once Occasional Regular
0.72 0.12 0.14 0.02
Part 2 What I'm really aiming for is to introduce new data datadump2222 instead of the original datadump1111 I was trying to do that by
a50DATA <- datadump2222
a50survey$drugs
What I get is
> a50survey$drugs
Not Tried once Occasional Regular
0.72 0.12 0.14 0.02
What I should get is
$drugs
Not Tried once Occasional Regular
0.52 0.14 0.34 0.00
...and when I run a50survey I was hoping that the the addition of the new a50DATA would get picked up by the S3 object a50survey and give me another set of outputs correct for the new dataset i.e. datadump2222. But it returns the data output from datadump1111
Request for help Can you guide me simply to the solution as I want to understand and replicate this? Thank you
The ouput from datadump1111 is here
> dim(datadump1111)
[1] 50 4
> str(datadump1111)
'data.frame': 50 obs. of 4 variables:
$ alcohol: Factor w/ 5 levels "Not","Once or Twice a week",..: 3 2 3 2 4 4 3 4 2 4 ...
$ drugs : Factor w/ 4 levels "Not","Tried once",..: 1 3 1 1 3 1 2 3 1 1 ...
$ smoke : Factor w/ 3 levels "Not","Occasional",..: 1 3 1 1 1 3 3 3 1 2 ...
$ sport : Factor w/ 2 levels "Not regular",..: 1 1 1 1 2 2 2 2 1 2 ...

How to sort a vector in R without repeating ranks

Good afternoon ,
My question may seem very elementary but i'm getting troubles with it.
Assume we have the following vector :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
I'm willing to sort the vector decreasingly , then getting indices which means :
sort.int(x, index.return=TRUE,decreasing=TRUE)
$x
[1] 1.00 1.00 0.75 0.75 0.50 0.50 0.50 0.25 0.25
$ix
[1] 3 4 1 2 5 6 7 8 9
However, the expected output should be :
y=c(2,2,1,1,3,3,3,4,4)
This means :
1 is the highest value ----- > 1
0.75 is the second highest value ----- > 2
0.5 is the third ----- > 3
0.25 is the lowest value -----> 4
I also tried :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
order(unique(sort(x)))
sort(unique(x),decreasing=TRUE)
[1] 1 2 3 4
[1] 1.00 0.75 0.50 0.25
But I don't know how to subset from x to get the expected output y .
Thank you for help !
sort will sort all the values, and use each value once. It seems like you want to ignore the indices of duplicated values after the first. We can use match for this, which will always return the index of the first match.
match(sort.int(x, decreasing = TRUE), unique(x))
# [1] 2 2 1 1 3 3 3 4 4

Identify first match position in a string

I have a character string ("00010000") and need to identify which position do we see the first "1". (This tells me which month a customer is active)
I have a dataset that looks like this:
id <- c(1:5)
seq <- c("00010000","00001000","01000000","10000000","00010000")
df <- data.frame(id,seq)
I would like to create a new field identifying the first_month_active for each id.
I can do this manually with a nested ifelse function:
df$first_month_active <-
ifelse(substr(df$seq,1,1)=="1",1,
ifelse(substr(df$seq,2,2)=="1",2,
ifelse(substr(df$seq,3,3)=="1",3,
ifelse(substr(df$seq,4,4)=="1",4,
ifelse(substr(df$seq,5,5)=="1",5,99 )))))
Which gives me the desired result:
id seq first_position
1 00010000 4
2 00001000 5
3 01000000 2
4 10000000 1
5 00010000 4
However, this is not an ideal solution for my data, which contains 36 months.
I would like to use a loop with an ifelse statement, however I am really struggling with syntax
for (i in 1:36) {
ifelse(substr(df$seq,0+i,0+i)=="1",0+i,
}
Any ideas would be greatly appreciated
Or try the stringi package
library(stringi)
stri_locate_first_fixed(df$seq, "1")[, 1]
## [1] 4 5 2 1 4
Skip the loop and the ifelse:
9 - nchar(as.numeric(seq))
## [1] 4 5 2 1 4
This won't work the same in your data.frame because you coerced seq to factor implicitly, so just do:
9 - nchar(as.numeric(as.character(df$seq)))
## [1] 4 5 2 1 4
Edit: Just for fun, since Frank didn't convert his comment into an answer, here's strsplit solution:
# from original vector
sapply(strsplit(seq, "1"), nchar)[1,] + 1
## [1] 4 5 2 1 4
# from data.frame
sapply(strsplit(as.character(df$seq), "1"), nchar)[1,] + 1
## [1] 4 5 2 1 4
You can use gregexpr.
> unlist(gregexpr(pattern=1,seq,fixed=T))
[1] 4 5 2 1 4
The following could do this job:
library(stringr)
str_locate(pattern ='1',seq)
Some comparisons:
library(stringi)
library(stringr)
seq <- c("00010010","00001000","10000010","10000000","00010000")
seq2 <- rep(seq, 5e6)
system.time(regexpr("1", seq2))
user system elapsed
4.78 0.03 4.82
system.time(9-nchar(as.numeric(as.character(seq2))))
user system elapsed
34.89 0.18 35.52
system.time(str_locate(pattern ='1',seq2))
user system elapsed
6.17 0.21 6.53
system.time(stri_locate_first_fixed(seq2, "1")[, 1])
user system elapsed
1.68 0.15 1.84
system.time(nchar(seq2)-round(log10(as.numeric(seq2))))
user system elapsed
7.67 0.09 7.86
system.time(nchar(sub('1.*', '', seq2))+1)
user system elapsed
14.61 0.11 14.93
Another one, using log:
nchar(seq)-round(log10(as.numeric(seq)))
Another option using sub
nchar(sub('1.*', '', seq))+1
#[1] 4 5 2 1 4

What is the difference between dataset[,'column'] and dataset$column in R?

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways:
> dataset[,'column']
> dataset$column
It appears that both give me the same result. What is the difference?
In practice, not much, as long as dataset is a data frame. The main difference is that the dataset[, "column"] formulation accepts variable arguments, like j <- "column"; dataset[, j] while dataset$j would instead return the column named j, which is not what you want.
dataset$column is list syntax and dataset[ , "column"] is matrix syntax. Data frames are really lists, where each list element is a column and every element has the same length. This is why length(dataset) returns the number of columns. Because they are "rectangular," we are able to treat them like matrices, and R kindly allows us to use matrix syntax on data frames.
Note that, for lists, list$item and list[["item"]] are almost synonymous. Again, the biggest difference is that the latter form evaluates its argument, whereas the former does not. This is true even in the form `$`(list, item), which is exactly equivalent to list$item. In Hadley Wickham's terminology, $ uses "non-standard evaluation."
Also, as mentioned in the comments, $ always uses partial name matching, [[ does not by default (but has the option to use partial matching), and [ does not allow it at all.
I recently answered a similar question with some additional details that might interest you.
Use 'str' command to see the difference:
> mydf
user_id Gender Age
1 1 F 13
2 2 M 17
3 3 F 13
4 4 F 12
5 5 F 14
6 6 M 16
>
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ user_id: int 1 2 3 4 5 6
$ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 2
$ Age : int 13 17 13 12 14 16
>
> str(mydf[1])
'data.frame': 6 obs. of 1 variable:
$ user_id: int 1 2 3 4 5 6
>
> str(mydf[,1])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[,'user_id'])
int [1:6] 1 2 3 4 5 6
> str(mydf$user_id)
int [1:6] 1 2 3 4 5 6
>
> str(mydf[[1]])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[['user_id']])
int [1:6] 1 2 3 4 5 6
mydf[1] is a data frame while mydf[,1] , mydf[,'user_id'], mydf$user_id, mydf[[1]], mydf[['user_id']] are vectors.

tapply function complains that args are unequal length yet they appear to match

Here is the failing call, error messages and some displays to show the lengths in question:
it <- tapply(molten, c(molten$Activity, molten$Subject, molten$variable), mean)
# Error in tapply(molten, c(molten$Activity, molten$Subject, molten$variable), :
# arguments must have same length
length(molten$Activity)
# [1] 679734
length(molten$Subject)
# [1] 679734
length(molten$variable)
# [1] 679734
dim(molten)
# [1] 679734 4
str(molten)
# 'data.frame': 679734 obs. of 4 variables:
# $ Activity: Factor w/ 6 levels "WALKING","WALKING_UPSTAIRS",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ Subject : Factor w/ 30 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ variable: Factor w/ 66 levels "tBodyAcc-mean()-X",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 0.257 0.286 0.275 0.27 0.275 ...
If you have a look at ?tapply you will see that X should be "an atomic object, typically a vector". You feed tapply with a data frame ("molten"), which is not an atomic object. See is.atomic, and try is.atomic(molten). Furthermore, your grouping variables should be provided as a list (see INDEX argument).
Something like this works:
tapply(X = warpbreaks$breaks, INDEX = list(warpbreaks$wool, warpbreaks$tension), mean)
# L M H
# A 44.55556 24.00000 24.55556
# B 28.22222 28.77778 18.77778
You need to have a single object for INDEX, butc( )will string them all together which is the source of the eror, so use a list:
it <- tapply(molten$value, list(Act=molten$Activity, sub=molten$Subject, var=molten$variable), mean)
Better would be:
it <- with(molten , tapply(value, list(Act=Activity, Sub=Subject, var=variable), mean) )
Ever got this solved? Because I had the same issue reading in a CSV file and could fix the issue by saving the original CSV file as CSV(delimiter seperated) instead of CSV(delimiter seperated-UTF-8). My dataset had German Umlauts in it though so that might play a role aswell.

Resources