Convert comma separated decimals from character to numeric - r

For my exam i have to build some scatter plots in r. I created a data frame with 4 variables. with this data frame i want to add regression lines to my scatter plots.
the name of my data frame is "alle".
variable names are: demo, tot, besch, usd
with this code i tried to line the regression line but got following result:
reg1<- lm(tot~demo, data=alle)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
here is the structure of "alle"
str(alle)
'data.frame': 11 obs. of 4 variables:
$ demo : chr "498.300.775" "500.297.033" "502.090.235" "503.170.618" ...
$ tot : Factor w/ 11 levels "4.846.423","4.871.049",..: 1 3 4 5 2 8 7 6 10 9 ...
$ besch: Factor w/ 9 levels "68,4","68,6",..: 5 7 3 2 2 1 1 4 6 8 ...
$ usd : Factor w/ 44 levels "0,68434","0,72584",..: 26 30 29 23 28 22 24 25 15 14 ...
Tried to convert column "demo" to numeric with
alle$demo <- as.numeric(as.character(alle$demo))
it converted the column to numeric but now the rows are full with "NA"s.
I think that i all columns must be numeric.
How can I convert all 4 columns to numeric and finally plot the regression lines.
Data:
> head(alle,6)
demo tot besch usd
1 498.300.775 4.846.423 69,8 1,3705
2 500.297.033 4.891.934 70,3 1,4708
3 502.090.235 4.901.358 69,0 1,3948
4 503.170.618 4.906.313 68,6 1,3257
5 502.964.837 4.871.049 68,6 1,3920
6 504.047.964 5.010.371 68,4 1,2848
thanks

Try doing it in two steps. First get rid of the dots, then replace the commas by decimal points and coerce to numeric.
alle[] <- lapply(alle, function(x) gsub("\\.", "", x))
alle[] <- lapply(alle, function(x) as.numeric(sub(",", ".", x)))
Note:
The above solution is broken in two for readability. The following does the same but it takes just one lapply loop and should therefore be faster if the dataset is big. If the dataset is small to medium, maybe the two steps solutions is preferable.
alle[] <- lapply(alle, function(x){
as.numeric(sub(",", ".", gsub("\\.", "", x)))
})

With dplyr:
library(dplyr)
alle %>%
mutate_all(as.character) %>%
mutate_at(c("besch","usd"),function(x) as.numeric(as.character(gsub(",",".",x)))) ->alle
demo tot besch usd
1 498.300.775 4.846.423 69.8 1.3705
2 500.297.033 4.891.934 70.3 1.4708
3 502.090.235 4.901.358 69.0 1.3948
4 503.170.618 4.906.313 68.6 1.3257
5 502.964.837 4.871.049 68.6 1.3920
6 504.047.964 5.010.371 68.4 1.2848

Related

Subset function in R works on one column but not on the other

I am trying to get some NBA data. However, I can't seem to be able to select a subset of my data properly. I tried using the subset function to get only the players with more than 10 games but it doesn't work for some reason. It subset works when I use a different column and I don't know why. Here's my code
install.packages("httr")
library(httr)
require("httr")
install.packages("jsonlite")
library(jsonlite)
require('jsonlite')
install.packages('dplyr')
library(dplyr)
require(dplyr)
params = list(AheadBehind="Ahead or Behind", ClutchTime="Last 5 Minutes",
College="", Conference="", Country="",DateFrom="", DateTo="",
Division="", DraftPick="", DraftYear="", GameScope="",
GameSegment="", Height="", LastNGames= "0", LeagueID="00",
Location="", MeasureType="Base", Month="0", OpponentTeamID="0",
Outcome="", PORound="0", PaceAdjust="N", PerMode="PerGame",
Period="0", PlayerExperience="", PlayerPosition="",
PlusMinus= "N", PointDiff="5", Rank="N", Season="2020-21",
SeasonSegment="",SeasonType="Regular Season", ShotClockRange="",
StarterBench="", TeamID="0", VsConference="", VsDivision="",
Weight="" )
request_headers = c('Accept'='application/json, text/plain, */*',
'Accept-Encoding'='gzip, deflate, br',
'Accept-Language'="en-US,en;q=0.9",
'Connection'='keep-alive',
'Host'='stats.nba.com',
'Origin'='https://www.nba.com',
'Referer'='https://www.nba.com/',
'Sec-Fetch-Dest'='empty',
'Sec-Fetch-Mode'='cors',
'Sec-Fetch-Site'='same-site',
'User-Agent'='Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36',
'x-nba-stats-origin'='stats',
'x-nba-stats-token'='true')
base <- 'https://stats.nba.com/stats/'
endpoint_players_clutch <- 'leaguedashplayerclutch'
call <- paste(base, endpoint_players_clutch, sep = '')
res <- httr::GET(call, httr::add_headers(.headers=request_headers), query=params)
json_resp <- jsonlite::fromJSON(content(res, "text"))
df <- data.frame(json_resp$resultSets$rowSet[1])
colnames(df) <- json_resp[["resultSets"]][["headers"]][[1]]
df2 <- select(df, PLAYER_NAME, GP, PLUS_MINUS)
df3 <- subset(df2, GP > 10)
# The line below works, but not the one above
# df3 <- subset(df2, PLUS_MINUS > 0)
Any solution would help, but it would help if the solution uses the subset function so that I know what I did wrong. Thanks
Let's show what should have been done:
df2 <- select(df, PLAYER_NAME, GP, PLUS_MINUS)
str(df2) # notice that GP displays like 'numeric' but is really 'factor'
#--------------
'data.frame': 382 obs. of 3 variables:
$ PLAYER_NAME: Factor w/ 382 levels "Aaron Gordon",..: 1 2 3 4 5 6 7 8 9 10 ...
$ GP : Factor w/ 23 levels "1","10","11",..: 22 22 12 12 6 1 21 1 2 12 ...
$ PLUS_MINUS : Factor w/ 89 levels "-0.1","-0.2",..: 53 63 33 29 3 39 14 86 78 10 ...
#-----------------------
df2$GP <- as.numeric(as.character(df2$GP)) #convert factor to numeric
df3 <- subset(df2, GP > 10)
str(df3)
#----------------------
'data.frame': 157 obs. of 3 variables:
$ PLAYER_NAME: Factor w/ 382 levels "Aaron Gordon",..: 5 13 14 16 17 21 23 29 31 32 ...
$ GP : num 14 18 16 11 14 18 18 13 17 12 ...
$ PLUS_MINUS : Factor w/ 89 levels "-0.1","-0.2",..: 3 5 48 76 33 11 66 43 10 3 ...
#----------
If this had been the result of a read.table or other read.* operation then depending on your version of R the GP column might have been either factor as it was for me in my 3.6.2 session or character as it might have been for anyone using an up-to-date version. The default for stringsAsFactors was changed in the transition to version 4 and above. When it is a factor, the GP column would first needs to be converted to character before it can be converted to numeric. In your case it might be that jsonlite has not yet made the same decision about assuming columns that can be numeric should be numeric.
If you are running R 4.0+ or above, you don't need the as.character inside the as.numeric.
With the data you're working with, and since you have a lot of variables (66), you could use the type.convert helper function to convert the df data object variables to logical, integer, numeric, character, factor, etc. as appropriate. This would be a good initial step to making sure the variable's in the initial data frame df are of the appropriate type.
For example:
df <- type.convert(data.frame(json_resp$resultSets$rowSet[1]))

Convert delimited string to numeric vector in dataframe

This is such a basic question, I'm embarrassed to ask.
Let's say I have a dataframe full of columns which contain data of the following form:
test <-"3000,9843,9291,2161,3458,2347,22925,55836,2890,2824,2848,2805,2808,2775,2760,2706,2727,2688,2727,2658,2654,2588"
I want to convert this to a numeric vector, which I have done like so:
test <- as.numeric(unlist(strsplit(test, split=",")))
I now want to convert a large dataframe containing a column full of this data into a numeric vector equivalent:
mutate(data,
converted = as.numeric(unlist(strsplit(badColumn, split=","))),
)
This doesn't work because presumably it's converting the entire column into a numeric vector and then replacing a single row with that value:
Error in mutate_impl(.data, dots) : Column converted must be
length 20 (the number of rows) or one, not 1274
How do I do this?
Here's some sample data that reproduces your error:
data <- data.frame(a = 1:3,
badColumn = c("10,20,30,40,50", "1,2,3,4,5,6", "9,8,7,6,5,4,3"),
stringsAsFactors = FALSE)
Here's the error:
library(tidyverse)
mutate(data, converted = as.numeric(unlist(strsplit(badColumn, split=","))))
# Error in mutate_impl(.data, dots) :
# Column `converted` must be length 3 (the number of rows) or one, not 18
A straightforward way would be to just use strsplit on the entire column, and lapply ... as.numeric to convert the resulting list values from character vectors to numeric vectors.
x <- mutate(data, converted = lapply(strsplit(badColumn, ",", TRUE), as.numeric))
str(x)
# 'data.frame': 3 obs. of 3 variables:
# $ a : int 1 2 3
# $ badColumn: chr "10,20,30,40,50" "1,2,3,4,5,6" "9,8,7,6,5,4,3"
# $ converted:List of 3
# ..$ : num 10 20 30 40 50
# ..$ : num 1 2 3 4 5 6
# ..$ : num 9 8 7 6 5 4 3
This might help:
library(purrr)
mutate(data, converted = map(badColumn, function(txt) as.numeric(unlist(strsplit(txt, split = ",")))))
What you get is a list column which contains the numeric vectors.
Base R
A=c(as.numeric(strsplit(test,',')[[1]]))
A
[1] 3000 9843 9291 2161 3458 2347 22925 55836 2890 2824 2848 2805 2808 2775 2760 2706 2727 2688 2727 2658 2654 2588
df$NEw2=lapply(df$NEw, function(x) c(as.numeric(strsplit(x,',')[[1]])))
df%>%mutate(NEw2=list(c(as.numeric(strsplit(NEw,',')[[1]]))))

Leftovers from sample function [duplicate]

This question already has answers here:
How to split data into training/testing sets using sample function
(28 answers)
Closed 6 years ago.
I have a question about show leftovers from sample function.
For school we had to make a test dataframe and a train dataframe.
The data that I have to validate has only a train dataframe.
The raw dataframe has 2158 observations. They made a train dataframe with 1529 observations.
set.seed(22)
train <- Gary[sample(1:nrow(Gary), 1529,
replace=FALSE),]
train[, 1] <- as.factor(unlist(train[, 1]))
train[, 2:201] <- as.numeric(as.factor(unlist(train[, 2:201])))
Now I want to have the "leftovers" in a different dataframe.
Do some of you know how to do this?
You can use negative indexing in R if you know the training indices. So we only need to rewrite your first lines:
set.seed(22)
train_indices <- sample(1:nrow(Gary), 1529, replace=FALSE)
train <- Gary[train_indices, ]
test <- Gary[-train_indices, ]
# Proceed with rest of script.
This can be done using the setdiff() function.
Edit: Please note that there is another answer by #AlexR using negative indexing which is much simpler if the indices are only used for subsetting.
However, first we need to create some dummy data as ther OP hasn't provided any data with the question (For future use, please read How to make a great R reproducible example?):
Dummy data
Create dummy data frame with 2158 rows and two columns:
n <- 2158
Gary <- data.frame(V1 = seq_len(n), V2 = sample(LETTERS, n , replace =TRUE))
str(Gary)
#'data.frame': 2158 obs. of 2 variables:
# $ V1: int 1 2 3 4 5 6 7 8 9 10 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 21 11 24 10 5 17 18 1 25 7 ...
Sampled and leftover rows
First, the vectors of sampled and leftover rows are computed, before subsetting Gary in subsequent steps:
set.seed(22)
sampled_rows <- sample(seq_len(nrow(Gary)), 1529, replace=FALSE)
leftover_rows <- setdiff(seq_len(nrow(Gary)), selected_rows)
train <- Gary[sampled_rows, ]
leftover <- Gary[leftover_rows, ]
str(train)
#'data.frame': 1529 obs. of 2 variables:
# $ V1: int 657 1025 2143 1123 1817 1558 1324 1590 898 801 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 19 16 25 15 2 5 8 14 20 3 ...
str(leftover)
#'data.frame': 629 obs. of 2 variables:
# $ V1: int 2 5 6 7 8 9 10 12 20 24 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 11 5 17 18 1 25 7 25 7 18 ...
leftover is a data frame which contains the rows of Gary which haven't been sampled.
Verification
To verify, we combine train and leftover again and sort the rows to compare with the original data frame:
recombined <- rbind(train, leftover)
identical(Gary, recombined[order(recombined$V1), ])
#[1] TRUE

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

Reverting to Factor Codes R

Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.

Resources