How to remove special character from data frame - r

I have imported data from a url and converted it to a data frame using the following code:
url <-"http://apims.doe.gov.my/v2/hourly2.php"
tables<- readHTMLTable(url)
try<-do.call(rbind, lapply(tables, data.frame, stringsAsFactors=FALSE))
The data has '*' next to the numbers. I would like to isolate the numbers only.
So instead of
52* 45* 67* 55*
I have
52 45 67 55
I have tried several methods to get the * special character out of 3rd through 8th columns and change the column to a numeric but since this character also has a meaning in R these are not working. I have tried:
x <- "~!##$%^&*"
str_replace_all(x, as.character(try[,3:8]), " ")
I have also tried:
gsub("*","",try[,3:8])
The only function that has identified the * characters correctly is grep and grapl but I need another function that will use the grep output to remove the '*' special character.
grep('*',try)

Try this:
dat<-do.call(rbind, lapply(tables, data.frame, stringsAsFactors=FALSE))
dat[, -(1:2)] <- sapply(dat[, -(1:2)], function(col) {
as.numeric(sub("[*]$", "", col))
})
head(dat)
# NEGERI...STATE KAWASAN.AREA MASA.TIME06.00AM MASA.TIME07.00AM MASA.TIME08.00AM MASA.TIME09.00AM MASA.TIME10.00AM MASA.TIME11.00AM
# NULL.1 Johor Kota Tinggi 52 53 52 50 50 49
# NULL.2 Johor Larkin Lama 51 51 51 NA 51 51
# NULL.3 Johor Muar 45 45 45 45 45 45
# NULL.4 Johor Pasir Gudang 56 56 55 56 56 56
# NULL.5 Kedah Alor Setar 50 50 50 50 50 49
# NULL.6 Kedah Bakar Arang, Sg. Petani NA NA NA NA NA NA

Related

Using a function and mapply in R to create new columns that sums other columns

Suppose, I have a dataframe, df, and I want to create a new column called "c" based on the addition of two existing columns, "a" and "b". I would simply run the following code:
df$c <- df$a + df$b
But I also want to do this for many other columns. So why won't my code below work?
# Reproducible data:
martial_arts <- data.frame(gym_branch=c("downtown_a", "downtown_b", "uptown", "island"),
day_boxing=c(5,30,25,10),day_muaythai=c(34,18,20,30),
day_bjj=c(0,0,0,0),day_judo=c(10,0,5,0),
evening_boxing=c(50,45,32,40), evening_muaythai=c(50,50,45,50),
evening_bjj=c(60,60,55,40), evening_judo=c(25,15,30,0))
# Creating a list of the new column names of the columns that need to be added to the martial_arts dataframe:
pattern<-c("_boxing","_muaythai","_bjj","_judo")
d<- expand.grid(paste0("martial_arts$total",pattern))
# Creating lists of the columns that will be added to each other:
e<- names(martial_arts %>% select(day_boxing:day_judo))
f<- names(martial_arts %>% select(evening_boxing:evening_judo))
# Writing a function and using mapply:
kick_him <- function(d,e,f){d <- rowSums(martial_arts[ , c(e, f)], na.rm=T)}
mapply(kick_him,d,e,f)
Now, mapply produces the correct results in terms of the addition:
> mapply(ff,d,e,f)
Var1 <NA> <NA> <NA>
[1,] 55 84 60 35
[2,] 75 68 60 15
[3,] 57 65 55 35
[4,] 50 80 40 0
But it doesn't add the new columns to the martial_arts dataframe. The function in theory should do the following
martial_arts$total_boxing <- martial_arts$day_boxing + martial_arts$evening_boxing
...
...
martial_arts$total_judo <- martial_arts$day_judo + martial_arts$evening_judo
and add four new total columns to martial_arts.
So what am I doing wrong?
The assignment is wrong here i.e. instead of having martial_arts$total_boxing as a string, it should be "total_boxing" alone and this should be on the lhs of the Map/mapply. As the OP already created the 'martial_arts$' in 'd' dataset as a column, we are removing the prefix part and do the assignment
kick_him <- function(e,f){rowSums(martial_arts[ , c(e, f)], na.rm=TRUE)}
martial_arts[sub(".*\\$", "", d$Var1)] <- Map(kick_him, e, f)
-check the dataset now
> martial_arts
gym_branch day_boxing day_muaythai day_bjj day_judo evening_boxing evening_muaythai evening_bjj evening_judo total_boxing total_muaythai total_bjj total_judo
1 downtown_a 5 34 0 10 50 50 60 25 55 84 60 35
2 downtown_b 30 18 0 0 45 50 60 15 75 68 60 15
3 uptown 25 20 0 5 32 45 55 30 57 65 55 35
4 island 10 30 0 0 40 50 40 0 50 80 40 0

2 lines of headers in R from csv

I have a lot of csv files with double headers as below. (This is only part of it, and both headers contain important info) How could I combine the first two rows of the csv file to obtain a single line of header? (e.g.Life.expectancy.at.birth..years..1Female)
Life.expectancy.at.birth..years..1 Life.expectancy.at.birth..years..2
1 Female Male
2 62 61
3 61 58
4 56 54
5 50 49
6 76 73
Read it twice and paste the headers together. For the second read limit the number of rows read since we really only need the header.
# in next 2 lines replace text=Lines with something like "myfile"
DF <- read.table(text = Lines, header = TRUE, skip = 1)
hdr1 <- read.table(text = Lines, header = TRUE, nrows = 1)
names(DF) <- paste0(names(hdr1), names(DF))
giving:
> DF
Life.expectancy.at.birth..years..1Female Life.expectancy.at.birth..years..2Male
1 62 61
2 61 58
3 56 54
4 50 49
5 76 73
Note: We used this for the input Lines:
Lines <- " Life.expectancy.at.birth..years..1 Life.expectancy.at.birth..years..2
Female Male
62 61
61 58
56 54
50 49
76 73"

R: Reformatting data file

I have what I suspect is a simple data reformatting question. The data file (txt) is structured with the observation numbers on separate lines,
1
45 65
78 56
2
89 34
39 55
The desired output is,
1 45 65
1 78 56
2 89 34
2 39 55
Suggestions on how to make that conversion would be most appreciated. Thanks.
We could read the file with readLines. Create an index variable and split the 'lines'. Remove the first element of the list elements, use read.table to read the file, and unnest
lines <- readLines('file.txt')
library(stringr)
#remove leading/lagging spaces if any
lines <- str_trim(lines)
#create the index mentioned above based on white space
indx <- !grepl('\\s+', lines)
#cumsum the above index to create grouping
indx1 <- cumsum(indx)
#split the lines with and change the names of the list elements
lst <- setNames(split(lines, indx1), lines[indx])
#Use unnest after reading with read.table
library(tidyr)
unnest(lapply(lst, function(x) read.table(text=x[-1])), gr)
# gr V1 V2
#1 1 45 65
#2 1 78 56
#3 2 89 34
#4 2 39 55
Or we can use Map from base R approach
do.call(rbind,Map(cbind, gr=names(lst),
lapply(lst, function(x) read.table(text=x[-1]))))

Apply over all columns and rows of two diffrent dataframes in R

I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45

Applying function to multiple rows using values from multiple rows

I have created the following simple function in R:
fun <- function(a,b,c,d,e){b+(c-a)*((e-b)/(d-a))}
That I want to apply this function to a data.frame that looks something like:
> data.frame("x1"=seq(55,75,5),"x2"=round(rnorm(5,50,10),0),"x3"=seq(30,10,-5))
x1 x2 x3
1 55 51 30
2 60 45 25
3 65 43 20
4 70 57 15
5 75 58 10
I want to apply fun to each separate row to create a new variable x4, but now comes the difficult part (to me at least..): for the arguments d and e I want to use the values x2 and x3 from the next row. So for the first row of the example that would mean: fun(a=55,b=51,c=30,d=45,e=25). I know that I can use mapply() to apply a function to each row, but I have no clue on how to tell mapply that it should use some values from the next row, or whether I should be looking for a different approach than mapply()?
Many thanks in advance!
Use mapply, but shift the fourth and fifth columns by one row. You can do it manually, or use taRifx::shift.
> dat
x1 x2 x3
1 55 25 30
2 60 58 25
3 65 59 20
4 70 68 15
5 75 43 10
library(taRifx)
> shift(dat$x2)
[1] 58 59 68 43 25
> mapply( dat$x1, dat$x2, dat$x3, shift(dat$x2), shift(dat$x3) , FUN=fun )
[1] 25.00000 -1272.00000 719.00000 -50.14815 26.10000
If you want the last row to be NA rather than wrapping, use wrap=FALSE,pad=TRUE:
> shift(dat$x2,wrap=FALSE,pad=TRUE)
[1] 58 59 68 43 NA

Resources