R read.table vs read.csv - r

grades <- read.table("studentgrades.csv",header = TRUE,row.names="StudentID", sep = ",")
gradess <- read.csv("studentgrades.csv",header = TRUE,row.names="StudentID", sep = ",")
The result of read.table is:
grades
[1] First Last Math Science Social.Studies
<0 rows> (or 0-length row.names)
The result of read.csv is:
gradess
First Last Math Science Social.Studies
11 Bob Smith 90 80 67
12 Jane Weary 75 NA 80
10 Dan "Thornton" 65 75 70
40 Mary O'Leary 90 95 92
I just don't know why the read.tables can not give me the right result.

The problem is due to the quote (') in O'Leary of last name column. You will need to change the default quote option in read.table which is set to the (') by default to get desired result.
If you use quote=NULL in read.table like below
grades <- read.table("studentgrades.csv",header = TRUE,sep=",",quote=NULL,row.names="StudentID")
Then you get the desired result.
> grades
First Last Math Science Social.Studies
11 Bob Smith 90 80 67
12 Jane Weary 75 NA 80
10 Dan "Thornton" 65 75 70
40 Mary O'Leary 90 95 92

Related

Create a new column in tibble with percentage of total per year

I've successfully transformed the first Tibble to the second one as shown below:
1.
# Animal Food 2015 2016
Monkey Banana 54 65
Monkey Hotdog 43 76
## # ... with 54 more rows
# Animal Year Banana Hotdog
Monkey 2015 54 43
Monkey 2016 65 76
## # ... with 54 more rows
Now I would like to create a new column where the percentage of Hotdogs is showing with this code:
df$hotdog_percent <- with(df, "Hotdog" / ( "Hotdog" + "Banana") )
However, I get the error non-numeric argument to binary operator. I've tried the below code to transform the original columns to numeric without success.
df$Banana <- as.numeric(as.character(df$Banana)) %>%
df$Hotdog <- as.numeric(as.character(df$Hotdog))
What am I supposed to do?
Try this without quoting in with
df$Banana <- as.numeric(df$Banana)
df$Hotdog <- as.numeric(df$Hotdog)
df$hotdog_percent <- with(df, Hotdog / (Hotdog + Banana) )
output
Animal Year Banana Hotdog hotdog_percent
1 Monkey 2015 54 43 0.4432990
2 Monkey 2016 65 76 0.5390071

Questions about how to divide and find averages of a dataset

Let's say I have a dataset where I have a list of names and their ages
Tom 65
Sam 40
Sue 88
Kay 4
Jon 25
Lia 85
Ian 39
Joe 10
Bea 17
Jan 43
Jen 17
Ike 24
Jay 35
Cam 77
Jin 12
Ron 1
Ray 45
Leo 29
Ken 98
Mel 56
Amy 49
Joy 67
Ivy 3
Noe 14
Max 31
Jax 61
Lee 19
Ace 28
Ben 5
Guy 74
I'm trying to divide the dataset into ten equal bins by descending order (Ex. the first bin will have Ken, Sue, and Lia and the last bin will have Ben, Ivy, and Ron) and I want to find the average age for each bin (So the average age for the first bin would be 90.33). I was able to do this on MS excel quite easily but I'm not exactly sure how to do this efficiently on R. Any suggestions?
We can use cut to create a group and then summarise by taking the mean
library(dplyr)
df1 %>%
group_by(grp = cut(v2, breaks = 10)) %>%
summarise(v1 = list(v1), v2 = mean(v2))

Convert row names into new columns in a data frame

Apologies in advance if this has already been asked elsewhere, but I've tried different attempts and nothing has worked so far.
In my data frame Mesure I would like to split the values of the column Row.names into two new columns named Sample_type and Locality. I try to use a tidyverse solution but R returns me that the column must not be dupicated... How can I modify it ? Also, is it possible to remove the "<" ?
> head(Mesure)
Row.names mean_Mesure max_Mesure min_Mesure
1 Aquatic_moss.Paris.AG-110m.< 100 110 90
2 Aquatic_moss.Paris.BE-7. 123 177 53
3 Aquatic_moss.Paris.CO-57.< 40 60 20
4 Aquatic_moss.Paris.CO-58.< 40 50 30
5 Aquatic_moss.Paris.CO-60.< 50 70 30
6 Aquatic_moss.Paris.CS-134.< 200 300 100
>
> library(tidyverse)
> new_df <- Mesure %>%
+ rownames_to_column(var = "Row.names") %>%
+ separate(Row.names,sep = ".",into = c("Sample_type","Locality"))
Error: Column name `Row.names` must not be duplicated.
Run `rlang::last_error()` to see where the error occurred.
To separate that with the first "dot" you can use:
Mesure %>%
separate(Row.names, sep = "\\.", into = c("Sample_type", "Locality"), extra = "merge")
Explanation:
You don't need to convert rownames_to_column(), because "Row.names" is already a column.
sep = "." is not enough as the . is taken as a regular expression.
There are many . in the column, so you need to specify extra = "merge" to separate only at first appearance. If you would like to keep only "Paris" without AG-110m etc, you specify extra = "drop" there.
Result with extra = "merge":
Sample_type Locality mean_Mesure max_Mesure min_Mesure
1 Aquatic_moss Paris.AG-110m.< 100 110 90
2 Aquatic_moss Paris.BE-7. 123 177 53
3 Aquatic_moss Paris.CO-57.< 40 60 20
4 Aquatic_moss Paris.CO-58.< 40 50 30
5 Aquatic_moss Paris.CO-60.< 50 70 30
6 Aquatic_moss Paris.CS-134.< 200 300 100
Result with extra = "drop":
Sample_type Locality mean_Mesure max_Mesure min_Mesure
1 Aquatic_moss Paris 100 110 90
2 Aquatic_moss Paris 123 177 53
3 Aquatic_moss Paris 40 60 20
4 Aquatic_moss Paris 40 50 30
5 Aquatic_moss Paris 50 70 30
6 Aquatic_moss Paris 200 300 100
If you need to drop "<" at the end of Locality column, run something like:
Mesure$Locality <- gsub("<$", "", Mesure$Locality)
where "<$" means "< at the end of the string".
Apologies. I should read your question properly. The second part of your answer would be:
d %>% separate(Row.names, into=c("Sample_type","Locality"), extra="drop")
# A tibble: 6 x 6
Sample_type Locality mean_Mesure max_Mesure min_Mesure
<chr> <chr> <dbl> <dbl> <dbl>
1 Aquatic moss 100 110 90
2 Aquatic moss 123 177 53
3 Aquatic moss 40 60 20
4 Aquatic moss 40 50 30
5 Aquatic moss 50 70 30
6 Aquatic moss 200 300 100
I can't help you with the first part because I don't know how you create the input data frame.

R Extract names from text

I'm trying to extract a list of rugby players names from a string. The string contains all of the information from a table, containing the headers (team names) as well as the name of the player in each position for each team. It also has the player ranking but I don't care about that.
Important - a lot of player rankings are missing. I found a solution to this however doesn't handle missing rankings (for example below Rabah Slimani is the first player not to have a ranking recorded).
Note, the 1-15 numbers indicate positions, and there's always two names following each position (home player and away player).
Here's the sample string:
" Team Sheets # FRA France RPI IRE Ireland RPI 1 Jefferson Poirot 72 Cian Healy 82 2 Guilhem Guirado 78 Rory Best 85 3 Rabah Slimani Tadhg Furlong 85 4 Arthur Iturria 82 Iain Henderson 84 5 Sebastien Vahaamahina 84 James Ryan 92 6 Wenceslas Lauret 82 Peter O'Mahony 93 7 Yacouba Camara 70 Josh van der Flier 64 8 Kevin Gourdon CJ Stander 91 9 Maxime Machenaud Conor Murray 87 10 Matthieu Jalibert Johnny Sexton 90 11 Virimi Vakatawa Jacob Stockdale 89 12 Henry Chavancy Bundee Aki 83 13 RĂ©mi Lamerat Robbie Henshaw 78 14 Teddy Thomas Keith Earls 89 15 Geoffrey Palis Rob Kearney 80 Substitutes # FRA France RPI IRE Ireland RPI 16 Adrien Pelissie Sean Cronin 84 17 Dany Priso 70 Jack McGrath 70 18 Cedate Gomes Sa 71 John Ryan 86 19 Paul Gabrillagues 77 Devin Toner 90 20 Marco Tauleigne Dan Leavy 80 21 Antoine Dupont 92 Luke McGrath 22 Anthony Belleau 65 Joey Carbery 86 23 Benjamin Fall Fergus McFadden "
Note - it comes from here: https://www.rugbypass.com/live/six-nations/france-vs-ireland-at-stade-de-france-on-03022018/2018/info/
So basically what I want is just the list of names with the team names as the headers e.g.
France Ireland
Jefferson Poirot Cian Healy
Guilhem Guirado Rory Best
... ...
Any help would be much appreciated!
I tried this on an advanced notepad editor and tried to find occurrences of 2 consecutive numbers and replaced those with a new line. the ReGex is
\d+\s+\d+
Once you are done replacing, you will be left with 2 names in each line separated by a number. Then use the below ReGex to replace that number with a single tab
\s+\d+\s+
Hope that helps

R how to avoid "for" when I want to go through dataframe

give a brief example.
I have data frame data1.
name<-c("John","John","Mike","Amy".....)
nationality<-c("Canada","America","Spain","Japan".....)
data1<-data.frame(name,nationality....)
which mean the people is from different countries
each people is specialize by his name and country, and no repeat.
the second data frame is
name2<-c("John","John","Mike","John",......)
nationality2<-c("Canada","Canada","Canada".....)
score<-c(87,67,98,78,56......)
data2<-data.frame(name2,nationality2,score)
every people is promised to have 5 rows in data2, which means they have 5 scores but they are in random order.
what I want to do is to know every person's 5 scores, but I didn't care what his name is and where he is from.
the final data frame I want to have is
score1 score2 score3 score4 score5
1 89 89 87 78 90
2 ...
3 ...
every row represent one person 5 scores but I don't care who he is.
my data number is so large so I can not use for function.
what can I do?
Although there is an already accepted answer which uses base R I would like to suggest a solution which uses the convenient dcast() function for reshaping from wide to long form instead of using tapply() and repeated calls to rbind():
library(data.table) # CRAN version 1.10.4 used
dcast(setDT(data2)[setDT(data1), on = c(name2 = "name", nationality2 = "nationality")],
name2 + nationality2 ~ paste0("score", rowid(rleid(name2, nationality2))),
value.var = "score")
returns
name2 nationality2 score1 score2 score3 score4 score5
1: Amy Canada 93 91 73 8 79
2: John America 3 77 69 89 31
3: Mike Canada 76 92 46 47 75
It seems to me that's what you're asking:
data1 <- data.frame(name = c("John","Mike","Amy"),
nationality = c("America","Canada","Canada"))
data2 <- data.frame(name2 = rep(c("John","Mike","Amy","Jack","John"),each = 5),
score = sample(100,25), nationality2 =rep(c("America","Canada","Canada","Canada","Canada"),each = 5))
data3 <- merge(data2,data1,by.x=c("name2","nationality2"),by.y=c("name","nationality"))
data3$name_country <- paste(data3$name2,data3$nationality2)
all_scores_list <- tapply(data3$score,data3$name_country,c)
as.data.frame(do.call(rbind,all_scores_list))
# V1 V2 V3 V4 V5
# Amy Canada 57 69 90 81 50
# John America 4 92 75 15 2
# Mike Canada 25 86 51 20 12

Resources