I have a dataset with columns that contain information of a code + name, which I would like to separate into 2 columns. So, just an example:
Column E5000_A contain values like `0080002. ALB - Democratic Party' in one cell, I would like two columns one containing the code 0080002, and the other containing the other info.
I have 8 more columns with values very similar (E5000_A until E5000_H). This is the code that I am writing.
cols2 <- c("E5000_A" , "E5000_B" , "E5000_C" , "E5000_D" ,
"E5000_E" , "E5000_F" , "E5000_G" , "E5000_H" )
for(i in cols2){
cses_imd_m <- cses_imd_m %>% mutate(substr(i, 1L, 7L))
}
But for some reason it is only generating a new column for the E5000_A and the loop does not go to the other variables. What am I doing wrong? Let me know if you need more details about the code or data frame.
data.frame approach
# to extract codes
df %>%
mutate_at(.vars = vars(c("E5000_A", "E5000_B", "E5000_C", "E5000_D", "E5000_E",
"E5000_F", "E5000_G", "E5000_H")),
.funs = function(x) str_extract("^\\d+", x))
You can also use across() inside of mutate().
If you want to use for loop
col_names <- c("E5000_A", "E5000_B", "E5000_C", "E5000_D", "E5000_E", "E5000_F", "E5000_G", "E5000_H")
for (i in col_names) {
df[,sprintf("code_%s", i)] <- str_extract("^\\d+", df[,i])
df[,sprintf("party_%s", i)] <- gsub(".*\\.", "", df[,i]) %>% str_trim() # remove all before dot (.)
}
Related
I'm trying to process a survey, in which one of the questions asks the respondents to name a friend. Now I have a matrix like this:
I want to save these results in a relational database. I have assigned every person a unique ID, and want the answers to be saved as a last of ID's. So that the table looks like this:
My code so far:
i've tried
df$name %in% df$friends
which did not give any results. I'm now trying to use a for loop with str_detect:
friends <- df$friends
names <- df$name
for (i in 1:length(names)) {
friends_called <- str_detect(friends, names[i])
id_index <- grep(names[i], df$name)
id <- df$id[id_index]
for (j in 1:length(friends_called)) {
if(friends_called[j] == T) {
df$friends_id[j] <- paste(df$friends_id[j], id, ",", sep="")
}
df$friends <- df$friends_id
But I have some issues with it:
It's not working
It uses two loops, which i'm used to from writing python but I read that i should avoid them in R
The string matching needs to be fuzzy (If Anna wrote "Jon" instead of "John", it should still match.
Does anyone have suggestions on how to tackle this?
You can do this without a loop in tidyverse as follows:
df %>%
mutate(friends = map(friends, ~ df %>%
filter(str_detect(.x,name)) %>%
select(id) %>%
unlist() %>%
paste(collapse = ',')))
gives
id name friends
1 a1d John b2e,c3f
2 b2e Anna a1d
3 c3f Denise
or with base R you can use sapply:
df$friends <- sapply(friends, function(x) paste(id[str_detect(x,name)],collapse = ','))
I am a beginner in R and while trying to make some exercises I got stuck in one of them. My data.frame is as follow:
LanguageWorkedNow LanguageNextYear
Java; PHP Java; C++; SQL
C;C++;JavaScript; JavaScript; C; SQL
And I need to know the variables which are in LanguageNextYear and are not in LanguageWorkedNow, to set a list with the different ones.
Sorry if the question is duplicated, I'm quite new here and tried to find it, but with no success.
base R
Idea: mapply setdiff on strsplitted NextYear and WorkedNow, and then paste it using collapse=";":
df$New <- with(df, {
a <- mapply(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";"), SIMPLIFY = FALSE)
sapply(a, paste, collapse=";")
})
# SIMPLIFY = FALSE is needed in a general case, it doesn't
# affect the output in the example case
# Or if you use Map instead of mapply, that is the default, so
# it could also be...
df$New <- with(df,
sapply(Map(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";")),
paste, collapse=";"))
data
df <- read.table(text = "WorkedNow NextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=TRUE, stringsAsFactors=FALSE)
Here's a solution using purrr package:
df = read.table(text = "
LanguageWorkedNow LanguageNextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=T, stringsAsFactors=F)
library(purrr)
df$New = map2_chr(df$LanguageWorkedNow,
df$LanguageNextYear,
~{x1 = unlist(strsplit(.x, split=";"))
x2 = unlist(strsplit(.y, split=";"))
paste0(x2[!x2%in%x1], collapse = ";")})
df
# LanguageWorkedNow LanguageNextYear New
# 1 Java;PHP Java;C++;SQL C++;SQL
# 2 C;C++;JavaScript JavaScript;C;SQL SQL
For each row you get your columns and you create vectors of values (separated by ;). Then you check which values of NextYear vector don't exist in WorkedNow vector and you create a string based on / combining those values.
The map function family will help you apply your logic / function to each row. In our case we use map2_chr as we have two inputs (your two columns) and we excpet a string / character output.
I am a linguistics student doing experiments in R. I have been looking at other questions and got a lot of help, but I am stuck at the moment as I cannot implement example functions to my case and would love to have some help.
First, I would like to go through every semester from here: http://registration.boun.edu.tr/schedule.htm, and every department here: http://registration.boun.edu.tr/scripts/schdepsel.asp
It is actually a bit easy to generate the list of it as the final link is something like this: http://registration.boun.edu.tr/scripts/sch.asp?donem=2017/2018-3&kisaadi=ATA&bolum=ATATURK+INSTITUTE+FOR+MODERN+TURKISH+HISTORY
Secondly, I need to select the code, name, days and hours of the course and tag the semester, which I did. (probably, I did it extremely poorly, but I did it nevertheless, yay!)
library("rvest")
library("dplyr")
library("magrittr")
# define the html
reg <- read_html("http://registration.boun.edu.tr/scripts/sch.asp?donem=2017/2018-3&kisaadi=ATA&bolum=ATATURK+INSTITUTE+FOR+MODERN+TURKISH+HISTORY")
# make the html a list of tables
regtable <- reg %>% html_table(fill = TRUE)
# tag their year
regtable[[4]][ ,15] <- regtable[[1]][1,2]
regtable[[4]][1,15] <- "Semester"
# Change the Days and Hours to sth usable, but how and to what?
# parse the dates, T and Th problem?
# parse the hour 10th hour problem?
# get the necessary info
regtable <- regtable %>% .[4] %>% as.data.frame() %>% select( . , X1 , X3 , X8 , X9 , V15)
# correct the names
names(regtable) <- regtable[1,]
regtable <- regtable[-1,]
View(regtable)
But the problem is that I want to write a function where I can do this for more than 20 semester and more than 50 departments. Any help would be great! I am doing this so that I can work on optimization for class hours for my department.
I guess I can do this better with XML Package, but I could not understand how to use it.
Thanks for any help,
Utku
Here is an answer building upon what you have already done. There are likely more efficient solutions, but this should be a good start. You also don't state how you would like to store the data, so currently what I have made will assign each combination of semester and department to its own data frame, which creates a huge amount for the number of departments. It is not ideal but I don't know how you plan to use the data after collection.
library("rvest")
library("dplyr")
library("magrittr")
# Create a Department list
dep_list <- read_html("http://registration.boun.edu.tr/scripts/schdepsel.asp")
# Take the read html and identify all objects of class menu2 and extract the
# href which will give you the final part of the url
dep_list <- dep_list %>%
html_nodes(xpath = '//*[#class="menu2"]') %>%
xml_attr("href")
department_list <- gsub("/scripts/sch.asp?donem=", "", dep_list, fixed = TRUE)
# Create a list for all of the semesters
sem_list <- read_html("http://registration.boun.edu.tr/schedule.htm")
sem_list <- sem_list %>% html_table(fill = TRUE)
# Extract the table from the list needed
semester_df <- sem_list[[2]]
# The website uses a table for the dropdown but the values are all in the second cell
# of the second column as a string
semester_list <- semester_df$X2[2]
# Separate the string into a list at the space characters
semester_list <- unlist(strsplit(semester_list, "\\s+"))
# Loop through the list of departments and within each department loop through the
# list of semesters to get the data you want
for(dep in department_list){
for(sem in semester_list){
url <- paste("http://registration.boun.edu.tr/scripts/sch.asp?donem=", sem, dep, sep = "")
reg <- read_html(url)
# make the html a list of tables
regtable <- reg %>% html_table(fill = TRUE)
# The data we want is in the 4th portion of the created list so extract that
regtable <- regtable[[4]]
# Rename the column headers to the values in the first row and remove the
# first row
regtable <- setNames(regtable[-1, ], regtable[1, ])
# Create semester column and select the variables we want
regtable <- regtable %>%
mutate(Semester = sem) %>%
select(Code.Sec, Name, Days, Hours, Semester)
# Assign the created table to a dataframe
# Could also save the file here instead
assign(paste("table", sem, gsub(" ", "_", dep), sep = "_"), regtable)
}
}
Thanks to #Amanda I was able to achieve what I wanted to do. Only thing left is to scraping shortnames list, match them and do the whole thing, but I can do what I want with creating a list. Any further comments to do this more elegantly are appreciated!
library("rvest")
library("dplyr")
library("magrittr")
# Create a Department list
dep_list <- read_html("http://registration.boun.edu.tr/scripts/schdepsel.asp")
dep_list <- dep_list %>% html_table(fill = TRUE)
# Select the table from the html that contains the data we want
department_df <- dep_list[[2]]
# Rename the columns with the value of the first row and remove row
department_df <- setNames(department_df[-1, ], department_df[1, ])
# Combine the two columns into a list
department_list <- c(department_df[, 1], department_df[, 2])
# Edit the department list
# We can choose accordingly.
department_list <- department_list[c(7,8,16,20,26,33,36,37,38,39)]
# Create a list for all of the semesters
sem_list <- read_html("http://registration.boun.edu.tr/schedule.htm")
sem_list <- sem_list %>% html_table(fill = TRUE)
# Extract the table from the list needed
semester_df <- sem_list[[2]]
# The website uses a table for the dropdown but the values are all in the second cell
# of the second column as a string
semester_list <- semester_df$X2[2]
# Separate the string into a list at the space characters
semester_list <- unlist(strsplit(semester_list, "\\s+"))
# Shortnames string
# We can add whichever we want.
shortname_list <- c("FLED", "HIST" , "PSY", "LL" , "PA" , "PHIL" , "YADYOK" , "SOC" , "TR" , "TKL" )
# Length
L = length(department_list)
# the function to get the schedule for the selected departments
for( i in 1:L){
for(sem in semester_list){tryCatch({
dep <- department_list[i]
sn <- shortname_list[i]
url_second_part <- interaction("&kisaadi=" , sn, "&bolum=", gsub(" ", "+", (gsub("&" , "%26", dep))), sep = "", lex.order = TRUE)
url <- paste("http://registration.boun.edu.tr/scripts/sch.asp?donem=", sem, url_second_part, sep = "")
reg <- read_html(url)
# make the html a list of tables
regtable <- reg %>% html_table(fill = TRUE)
# The data we want is in the 4th portion of the created list so extract that
regtable <- regtable[[4]]
# Rename the column headers to the values in the first row and remove the
# first row
regtable <- setNames(regtable[-1, ], regtable[1, ])
# Create semester column and select the variables we want
regtable <- regtable %>%
mutate(Semester = sem) %>%
select(Code.Sec, Name, Days, Hours, Semester)
# Assign the created table to a dataframe
# Could also save the file here instead
assign(paste("table", sem, gsub(" ", "_", dep), sep = "_"), regtable)
}, error = function(e){cat("ERROR : No information on this" , url , "\n" )})
}
}
### Maybe make Errors another dataset or list too.
imported tibble from textfile. Many numeric columns are imported as "chr". I guess it's because they contain a "," instead of a ".".
My goal is to write a loop which runs through the names of desired columns, replaces "," with "." and converts columns into "num".
Little example:
data <- data.frame("A1" =c("2,1","2,1","2,1"), "A2" =c("1,3","1,3","1,3"),
stringsAsFactors = F) %>% as.tibble() #example data
colname <- c("A1", "A2") #creating variable for loop
for(i in colname) {
nam <- paste0("data$", i)
assign(nam, as.numeric(gsub(",",".", eval(parse(text = paste0("data$",i))))) )
}
Instead of overwriting the existing column, R creates a new variable:
data$A1 # that's the existing column as part of the tibble
[1] "2,1" "2,1" "2,1"
`data$A1` # thats just a new variable. mind the little``
[1] 2.1 2.1 2.1
I also tried to assign (<-) the new numeric values via eval, but that does not work either.
eval(parse(text = paste0("data$", i))) <- as.numeric(
gsub(",",".", eval(parse(text = paste0("data$",i)))))
Error: target of assignment expands to non-language object
Any suggestions on how to transform? I have the same issue with other columns that I want to aggregate to a new variable. This variable should also be part of the existing tibble. I could do it by hand. This would take lots of time and probably produce many mistakes.
Thanks a lot!
Sam
As you are already working with the tidyverse, you can use dplyr::mutate_at and the colname variable you have already defined.
data %>%
mutate_at(.vars = colname,
.funs = function(x) { as.numeric(gsub(",", ".", x)) })
I have a script which generates multiple dataframes after scraping data from internet
library("rvest")
urllist <- c("https://en.wikipedia.org/wiki/Jawaharlal_Nehru",
"https://en.wikipedia.org/wiki/Indira_Gandhi")
for(i in 1:length(urllist))
{ mydata <- urllist[i]
print(url)
mydata<- url %>%
html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/table[1]') %>%
html_table()
X <- mydata[[1]]
assign(paste("df", i, sep = '_'), X)
}
so it creates df_1,df_2 etc.
After download all this dataframe has 2 columns.1st column name is that person name, 2nd column name is NA.
How I can rename all those dataframes column names as 1st column name as "ID", 2nd column name as the person name dynamically ?
My below try is failing.This is changing those string...it is not affecting my dataframes.
for(i in 1:length(urllist))
{ assign(colnames(get(paste("df", i, sep = '_')))[1],"ID")
assign(colnames(get(paste("df", i, sep = '_')))[2],colnames(get(paste("df", i, sep = '_')))[1])
}
My final goal is then to merge all those dataframes in a single dataframe based on column "ID".
What could be the way ?
Solved it this way:
for (i in (1:length(urllist)))
{
df.tmp <- get(paste("df", i, sep = '_'))
names(df.tmp) <- c("ID",colnames(get(paste("df", i, sep = '_')))[1] )
assign(paste("df",i,sep='_'), df.tmp)
}
for merging i have solved this way:
#making the list without the 1st df
alldflist = lapply(ls(pattern = "df_[2]"), get)
#merge multiple data frames by ID
#note at first taking the 1st df
mergedf<-df_1
for ( .df in alldflist )
{
mergedf <-merge(mergedf,.df,by.x="ID", by.y="ID",all=T)
}
It works. But Can anybody please suggest a better way for this dynamic dataframe name and merging into a single dataframe
Using a list as Roman pointed out in his comment would definitely work in this case but if you're already looping through your list why don't you just do it using your initial for loop...something like this:
colnames(X) <- c("ID", colnames(X)[1])
This is assuming you want the first column name to be the second column name which it looks like this is the case based on your second loop.