I have a data frame on name (df) as follows.
ID name
1 Xiaoao
2 Yukata
3 Kim
4 ...
Examples of API are like this.
European-SouthSlavs,0.2244 Muslim-Pakistanis-Bangladesh,0.0000 European-Italian-Italy,0.0061 ...
And I would like to add a new column using API that returns nationality scores up to 39 nationalities and I would like to list up to top 3 scores per name. My desired outcomes as follows.
ID name score nat
1 Xiaoao 0.7361 Chinese
1 Xiaoao 0.1721 Korean
1 Xiaoao 0.0721 Japanese
2 Yukata 0.8121 Japanese
2 Yukata 0.0811 Chinese
2 Yukata 0.0122 Korean
3 Kim 0.6532 Korean
3 Kim 0.2182 Chinese
3 Kim 0.0981 Japanese
4 ... ... ...
Below is my some of scratch to get it done. But I failed to get the desired outcomes for a number errors.
df_result <- purrr::map_dfr(df$name, function(name) {
result <- GET(paste0("http://www.name-prism.com/api_token/nat/csv/",
"API TOKEN","/", URLencode(df$name)))
if(http_error(result)){
NULL
}else{
nat<- content(result, "text")
nat<- do.call(rbind, strsplit(strsplit(nat, split = "(?<=\\d)\n", perl=T)[[1]],","))
#first three nationalities
top_nat <- nat[order(as.numeric(nat[,2]), decreasing = T)[1:3],]
c(df$name,as.vector(t(top_nat)))
}
})
First, the results of top scores were based on the entire data rather than per name.
Second, I faced an error saying "Error in dplyr::bind_rows():! Argument 1 must have names."
If you can add any comments on my codings, I will appreciate it!
Thank you in advance.
The output of each iteration of the map_dfr should be a dataframe for which to bind rows:
library(tidyverse)
library(httr)
df <- data.frame(name = c("Xiaoao", "Yukata", "Kim"))
map_dfr(df$name, function(name) {
data.frame(name = df$name, score = sample(1:10, 1))
})
Instead of concatenating name with top_nat at the end of your function, you should be making it a data.frame!
Related
I want to do an network analysis of the tweets of some users of my interest and the mentioned users in their tweets.
I retrieved the tweets (no retweets) from several user timelines using the rtweet package in r and want to see who they mention in their tweets.
There is even a variable with the screen names of those useres who are mentioned which will serve me as the target group for my edge list. But sometimes they mention several users and then the observation looks for example like this: c('luigidimaio', 'giuseppeconteit') whereas there is only one user mentioned it is naming just this one user as an observation (eg. agorarai). I want to split those observations containing several mentioned users into single observations for each user. So out of one observation containing both mentioned users as a vector I would have to split it into two observation each containing one of the mentioned users.
The code looks like this so far:
# get user timelines of the most active italian parties (excluding retweets)
tmls_nort <- get_timelines(c("Mov5Stelle", "pdnetwork", "LegaSalvini"),
n = 3200, include_rts = FALSE
)
# create an edge list
tmls_el = as.data.frame(cbind(Source = tolower(tmls_nort$screen_name), Target = tolower(tmls_nort$mentions_screen_name)))
Here is an extract of my dataframe:
Source Target n
<fct> <fct> <int>
1 legasalvini circomassimo 2
2 legasalvini 1giornodapecora 2
3 legasalvini 24mattino 2
4 legasalvini agorarai 28
5 legasalvini ariachetira 2
6 legasalvini "c(\"raiportaaporta\", \"brunovespa\")" 7
```
We can start from this: first you could clean up your columns, tidy up the data and plot your network.
The data I used are:
tmls_el
Source Target n
1 legasalvini circomassimo 2
2 legasalvini 1giornodapecora 2
3 legasalvini 24mattino 2
4 legasalvini agorarai 28
5 legasalvini ariachetira 26
6 legasalvini c("raiportaaporta", "brunovespa") 7
7 movimento5stelle c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8") 20
Now the what I've done:
# here you replace the useless characer with nothing
tmls_el$Target <- gsub("c\\(\"", "", tmls_el$Target)
tmls_el$Target <- gsub("\\)", "", tmls_el$Target)
tmls_el$Target <- gsub("\"", "", tmls_el$Target)
library(stringr)
temp <- data.frame(str_split_fixed(tmls_el$Target, ", ", 8))
tmls_el_2 <- data.frame(
Source = c(rep(as.character(tmls_el$Source),8))
, Target = c(as.character(temp$X1),as.character(temp$X2),as.character(temp$X3),
as.character(temp$X4),as.character(temp$X5),as.character(temp$X6),
as.character(temp$X7),as.character(temp$X8))
, n = c(rep(as.character(tmls_el$n),8)))
Note: it works with the example you give, if you have more than 8 target, you have to change the number 2 to 2,3,...k, and paste the new column in Target, and repeat k times Source and n. Surely there is a more elegant way, but this works.
Here you can create edges and nodes:
library(dplyr)
el <- tmls_el_2 %>% filter(Target !='')
no <- data.frame(name = unique(c(as.character(el$Source),as.character(el$Target))))
Now you can use igraph to plot the results:
library(igraph)
g <- graph_from_data_frame(el, directed=TRUE, vertices=no)
plot(g, edge.width = el$n/2)
With data:
tmls_el <- data.frame(Source = c("legasalvini","legasalvini","legasalvini","legasalvini","legasalvini","legasalvini","movimento5stelle"),
Target = c("circomassimo","1giornodapecora","24mattino","agorarai","ariachetira","c(\"raiportaaporta\", \"brunovespa\")","c(\"test1\", \"test2\", \"test3\", \"test4\", \"test5\", \"test6\", \"test7\", \"test8\")"),
n = c(2,2,2,28,26,7,20))
I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))
I'm new to R so please forgive the repetitive question. I was trying to do this in Access (I know) but unfortunately the application kept crashing.
I have a dataframe object that contains 78k records that I imported from a CSV, and it should form a tree like structure, while there may not be a natural root however as this is a subset of the entire org.
POS_NUM|TITLE|REPORT_TO_POS_NUM
1234 Bob 789
5698 Jim 1234
8976 Frank 1653
This should for a loose relationship tree relationship
Bob
\ Jim
Frank
Essentially I need this to calculate the number of sub reports for each person, the number direct reports, as well as some other recursive functions
EDIT
Right now I'm attempting to simply loop through my table
treeDataOne <- read.csv(file="File1.csv", header=TRUE, stringsAsFactors=FALSE sep=",")
treeDataTwo <- read.csv(file="File2.csv",header=TRUE, stringsAsFactors=FALSE, sep=",") #Same columns, different data
treeDataAll <- rbind(treeDataOne, treeDataTwo) #Merge data, this seems to work
#Adding new columns to store data
treeDataAll['DIRECT_REPORTS'] <- 0
treeDataAll['INDIRECT_REPORTS'] <- 0
treeDataAll['DIVISION'] <- ""
treeDataAll['BRANCH'] <- ""
treeDataAll['PROCESSED'] <- FALSE
I'm now trying to iterate over every record and calculate the direct reports
So I'm pseudo code it should be:
for i in treeDataAll{
i.DIRECT_REPORTS = nrow(where REPORT_TO_POS_NUM = i.pos_num)
}
library(data.table)
setDT(treeDataAll)
funky <- function(x){
nrow(treeDataAll[REPORT_TO_POS_NUM == x])
}
treeDataAll[, DIR_REPORTS := funky(POS_NUM), by = POS_NUM]
treeDataAll[]
# POS_NUM TITLE REPORT_TO_POS_NUM DIR_REPORTS
# 1: 1234 Bob 789 1
# 2: 5698 Jim 1234 0
# 3: 8976 Frank 1653 0
I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set
Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode