Leftovers from sample function [duplicate] - r

This question already has answers here:
How to split data into training/testing sets using sample function
(28 answers)
Closed 6 years ago.
I have a question about show leftovers from sample function.
For school we had to make a test dataframe and a train dataframe.
The data that I have to validate has only a train dataframe.
The raw dataframe has 2158 observations. They made a train dataframe with 1529 observations.
set.seed(22)
train <- Gary[sample(1:nrow(Gary), 1529,
replace=FALSE),]
train[, 1] <- as.factor(unlist(train[, 1]))
train[, 2:201] <- as.numeric(as.factor(unlist(train[, 2:201])))
Now I want to have the "leftovers" in a different dataframe.
Do some of you know how to do this?

You can use negative indexing in R if you know the training indices. So we only need to rewrite your first lines:
set.seed(22)
train_indices <- sample(1:nrow(Gary), 1529, replace=FALSE)
train <- Gary[train_indices, ]
test <- Gary[-train_indices, ]
# Proceed with rest of script.

This can be done using the setdiff() function.
Edit: Please note that there is another answer by #AlexR using negative indexing which is much simpler if the indices are only used for subsetting.
However, first we need to create some dummy data as ther OP hasn't provided any data with the question (For future use, please read How to make a great R reproducible example?):
Dummy data
Create dummy data frame with 2158 rows and two columns:
n <- 2158
Gary <- data.frame(V1 = seq_len(n), V2 = sample(LETTERS, n , replace =TRUE))
str(Gary)
#'data.frame': 2158 obs. of 2 variables:
# $ V1: int 1 2 3 4 5 6 7 8 9 10 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 21 11 24 10 5 17 18 1 25 7 ...
Sampled and leftover rows
First, the vectors of sampled and leftover rows are computed, before subsetting Gary in subsequent steps:
set.seed(22)
sampled_rows <- sample(seq_len(nrow(Gary)), 1529, replace=FALSE)
leftover_rows <- setdiff(seq_len(nrow(Gary)), selected_rows)
train <- Gary[sampled_rows, ]
leftover <- Gary[leftover_rows, ]
str(train)
#'data.frame': 1529 obs. of 2 variables:
# $ V1: int 657 1025 2143 1123 1817 1558 1324 1590 898 801 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 19 16 25 15 2 5 8 14 20 3 ...
str(leftover)
#'data.frame': 629 obs. of 2 variables:
# $ V1: int 2 5 6 7 8 9 10 12 20 24 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 11 5 17 18 1 25 7 25 7 18 ...
leftover is a data frame which contains the rows of Gary which haven't been sampled.
Verification
To verify, we combine train and leftover again and sort the rows to compare with the original data frame:
recombined <- rbind(train, leftover)
identical(Gary, recombined[order(recombined$V1), ])
#[1] TRUE

Related

Convert comma separated decimals from character to numeric

For my exam i have to build some scatter plots in r. I created a data frame with 4 variables. with this data frame i want to add regression lines to my scatter plots.
the name of my data frame is "alle".
variable names are: demo, tot, besch, usd
with this code i tried to line the regression line but got following result:
reg1<- lm(tot~demo, data=alle)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
here is the structure of "alle"
str(alle)
'data.frame': 11 obs. of 4 variables:
$ demo : chr "498.300.775" "500.297.033" "502.090.235" "503.170.618" ...
$ tot : Factor w/ 11 levels "4.846.423","4.871.049",..: 1 3 4 5 2 8 7 6 10 9 ...
$ besch: Factor w/ 9 levels "68,4","68,6",..: 5 7 3 2 2 1 1 4 6 8 ...
$ usd : Factor w/ 44 levels "0,68434","0,72584",..: 26 30 29 23 28 22 24 25 15 14 ...
Tried to convert column "demo" to numeric with
alle$demo <- as.numeric(as.character(alle$demo))
it converted the column to numeric but now the rows are full with "NA"s.
I think that i all columns must be numeric.
How can I convert all 4 columns to numeric and finally plot the regression lines.
Data:
> head(alle,6)
demo tot besch usd
1 498.300.775 4.846.423 69,8 1,3705
2 500.297.033 4.891.934 70,3 1,4708
3 502.090.235 4.901.358 69,0 1,3948
4 503.170.618 4.906.313 68,6 1,3257
5 502.964.837 4.871.049 68,6 1,3920
6 504.047.964 5.010.371 68,4 1,2848
thanks
Try doing it in two steps. First get rid of the dots, then replace the commas by decimal points and coerce to numeric.
alle[] <- lapply(alle, function(x) gsub("\\.", "", x))
alle[] <- lapply(alle, function(x) as.numeric(sub(",", ".", x)))
Note:
The above solution is broken in two for readability. The following does the same but it takes just one lapply loop and should therefore be faster if the dataset is big. If the dataset is small to medium, maybe the two steps solutions is preferable.
alle[] <- lapply(alle, function(x){
as.numeric(sub(",", ".", gsub("\\.", "", x)))
})
With dplyr:
library(dplyr)
alle %>%
mutate_all(as.character) %>%
mutate_at(c("besch","usd"),function(x) as.numeric(as.character(gsub(",",".",x)))) ->alle
demo tot besch usd
1 498.300.775 4.846.423 69.8 1.3705
2 500.297.033 4.891.934 70.3 1.4708
3 502.090.235 4.901.358 69.0 1.3948
4 503.170.618 4.906.313 68.6 1.3257
5 502.964.837 4.871.049 68.6 1.3920
6 504.047.964 5.010.371 68.4 1.2848

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

Ranking entries in a column based on sums of entries in another column

everyone. I am a beginner in R with a question I can't quite figure out. I've created multiple queries within Stack Overflow to address my question (links to results here, here, and here) but none have addressed my issue. On to the problem: I have subset data frame DAV from a larger dataset.
> str(DAV)
'data.frame': 994 obs. of 9 variables:
$ MIL.ID : Factor w/ 18840 levels "","0000151472",..: 7041 9258 10513 5286 5759 5304 5312 5337 5337 5547 ...
$ Name : Factor w/ 18395 levels ""," Atticus Finch",..: 1226 6754 12103 17234 2317 14034 15747 4542 4542 14819 ...
$ Center : int 2370 2370 2370 2370 2370 2370 2370 2370 2370 2370 ...
$ Gift.Date : Factor w/ 339 levels "","01/01/2015",..: 6 6 6 7 10 13 13 13 13 13 ...
$ Gift.Amount: num 100 47.5 150 41 95 ...
$ Solic. : Factor w/ 31 levels "","aa","ac","an",..: 20 31 20 29 20 8 28 8 8 8 ...
$ Tender : Factor w/ 10 levels "","c","ca","cc",..: 3 2 3 5 2 9 3 9 9 9 ...
$ Account : Factor w/ 16 levels "","29101-0000",..: 4 4 4 11 2 11 2 11 2 11 ...
$ Restriction: Factor w/ 258 levels "","AAU","ACA",..: 216 59 216 1 137 1 137 1 38 1 ...
The two relevant columns for my issue are MIL.ID, which contains a unique ID for a donor, and Gift.Amount, which contains a dollar amount for a single gift the donor gave. A single MIL.ID is often associated with multiple Gift.Amount entries, meaning that donor has given on multiple different occasions for various amounts. Here is what I want to do:
I want to separate out the above mentioned columns from the rest of the data frame;
I want to sum(Gift.Amount) but only do so for each donor, i.e. I want to create a sum of all gifts for MIL.ID 1234 in the above data.frame; and
I want to rank all the MIL.IDs based on the sum Gift.Amount entries associated with their ID.
I apologize for how basic this is, and if it is redundant to a question already asked, but I couldn't find anything.
Edit to address comment:
shot of table
> print(ranking)
Desired output
I am struggling to get the formatting correct here so I included screen shots
This should do it:
df <- DAV[, c("MIL.ID", "Gift.Amount")] #extract columns
df <- aggregate(Gift.Amount ~ MIL.ID, df, sum) #sum amounts with same ID
df <- df[ order(df$Gift.Amount,decreasing = TRUE), ] #sort Decreasing

Graphing data that is read using readHTMLTable

I want to read the following table , from a webpage then create a bargraph.
Language............ Jobs
PHP.................... 12,664
Java................... 12,558
Objective C......... 8,925
SQL.................... 5,165
Android (Java).... 4,981
Ruby................... 3,859
JavaScript........... 3,742
C#....................... 3,549
C++..................... 1,908
ActionScript......... 1,821
Python................. 1,649
C.......................... 1,087
ASP.NET............... 818
My questions:
1.The problem that my bars get messed up and each bar does correspond to the correct language
The following is my code:
library(XML)
tables2 <-(readHTMLTable("http://www.sitepoint.com/best-programming-language-of-2013/",which=1))
barplot(as.numeric(tables2$Job),names.arg=tables2$Language)
Since I am a beginner at R I would like to know in what format does readHTMLTable save the data in? is it a matrix, data frame or other format?
The main problem here is that Jobs is being read as a factor. Because of the commas in that field, you can't do a direct numeric conversion. You can find out what 'format' your object is in R by doing str(). Here str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : Factor w/ 13 levels "1,087","1,649",..: 6 5 12 11 10 9 8 7 4 3 ...
So you can see Jobs is a factor, and that tables2 is a data.frame. To convert it to numeric you need to remove the commas. You can do that with gsub().
tables2$Jobs <- as.numeric(gsub(",","",tables2$Jobs))
No str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : num 12664 12558 8925 5165 4981 ...
and when you do your plot, all should be well:
barplot(tables2$Jobs,names.arg=tables2$Language)

Reverting to Factor Codes R

Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.

Resources