strsplit by variable separator - r

I have some strings of data separated by " " that needs to be split into columns. Is there an easy way to split the data by every nth separator. For example, the first value in x tells you that the first 4 values in y correspond to the first trial. The second value in x tells you that the next 3 values in y correspond to the second trial, and so on.
x <- c("4 3 3", "3 3 3 2 3")
y <- c("110 88 77 66 55 44 33 22 33 44 11 22 11", "44 55 66 33 22 11 22 33 44 55 66 77 88 66 77 88")
The goal is something like this:
structure(list(session = 1:2, trial.1 = structure(1:2, .Label = c("110 88 77",
"44 55 66"), class = "factor"), trial.2 = structure(c(2L, 1L), .Label = c("33 22 11",
"66 55 44"), class = "factor"), trial.3 = structure(1:2, .Label = c("22 33 44",
"23 33 44"), class = "factor"), trial.4 = structure(c(NA, 1L), .Label = "55 66", class = "factor"),
trial.5 = structure(c(NA, 1L), .Label = "77 88 66", class = "factor")), .Names = c("session",
"trial.1", "trial.2", "trial.3", "trial.4", "trial.5"), class = "data.frame", row.names = c(NA,
-2L))
Ideally, any extra values from y need to be dropped from the resulting data frame, and the uneven row lengths should be filled with NA's.

This maybe useful
dumx<-strsplit(x,' ')
dumy<-strsplit(y,' ')
dumx<-lapply(dumx,function(x)(cumsum(as.numeric(x))))
dumx<-lapply(dumx,function(x){mapply(seq,c(1,x+1)[-(length(x)+1)],x,SIMPLIFY=FALSE)})
ans<-mapply(function(x,y){lapply(x,function(w,z){z[w]},z=y)},dumx,dumy)
I will leave you to convert the resulting list to dataframe :)

Related

Replacing column names with another data frame if matches

Hi I am looking into figuring out how to match data frames together by column, then renaming it. If there is no name that matches, then I would want to drop that column instead.
For example, I would use this main dataset, call it DF1:
Name
Reference
Good
Fair
Bad
Great
Poor
George
Hill
34
21
33
21
32
Frank
Stairs
29
28
29
30
29
Bertha
Trail
25
25
24
21
26
Then another DF, call this DF2, that allows me to replace the names of the columns of DF1
Name
Adjusted_Name
Good
good_run
Great
very_great_work
Bad
bad run
Fair
fair run decent
Essentially, the words that would be substituted would not be any pattern of any sort, and I would try to match this first column in DF2 and match to DF1, and if there is a match in DF2$Name and DF(whatever column), then I would replace that name with the same row of DF2$Adjusted_Name. If there is no match, then the value in DF1 is dropped.
So the final goal would be to achieve:
Name
Reference
good_run
fair run decent
Bad run
very_great_work
George
Hill
34
21
33
21
Frank
Stairs
29
28
29
30
Bertha
Trail
25
25
24
21
In this case, "poor" was dropped because it didnt match the column name of DF1.
How should I go about this? How would I account if there thousands of columns? Does that change anything in how i Code? I am a bit new to R, and would appreciate any tips. Thank you!
If you are open to a tidyverse solution, you could use
library(dplyr)
library(tibble)
df %>%
rename_with(~deframe(df2)[.x], .cols = df2$Name) %>%
select(Name, Reference, any_of(df2$Adjusted_Name))
This returns
# A tibble: 3 x 6
Name Reference good_run very_great_work bad_run fair_run_decent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 George Hill 34 21 33 21
2 Frank Stairs 29 30 29 28
3 Bertha Trail 25 21 24 25
Data
df <- structure(list(Name = c("George", "Frank", "Bertha"), Reference = c("Hill",
"Stairs", "Trail"), Good = c(34, 29, 25), Fair = c(21, 28, 25
), Bad = c(33, 29, 24), Great = c(21, 30, 21), Poor = c(32, 29,
26)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L), spec = structure(list(cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Reference = structure(list(), class = c("collector_character",
"collector")), Good = structure(list(), class = c("collector_double",
"collector")), Fair = structure(list(), class = c("collector_double",
"collector")), Bad = structure(list(), class = c("collector_double",
"collector")), Great = structure(list(), class = c("collector_double",
"collector")), Poor = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(Name = c("Good", "Great", "Bad", "Fair"), Adjusted_Name = c("good_run",
"very_great_work", "bad_run", "fair_run_decent")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Adjusted_Name = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Try the following - using the list of adjusted names, you can grep the list of desired words against column names and subset the data frame on it:
Data
df <- read.table(header = TRUE, text = "Name Reference Good Fair Bad Great Poor
George Hill 34 21 33 21 32
Frank Stairs 29 28 29 30 29
Bertha Trail 25 25 24 21 26")
adj_name <- c("good_run","very_great_run","bad run","fair run decent")
Index the columns based on grep from the string of desired names (note the tolower() on the column names as well)
desired_words <- paste(unlist(strsplit(adj_name, "_| ")), collapse = "|")
df[,c(1:2,grep(desired_words, tolower(names(df))))]
Output
# Name Reference Good Fair Bad Great
#1 George Hill 34 21 33 21
#2 Frank Stairs 29 28 29 30
#3 Bertha Trail 25 25 24 21

Correlation of similar variables in R

I have slightly edited the data table.
I would like to correlate variable with similar name in my dataset:
A_y B_y C_y A_p B_p C_p
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30
..........
n 55 23 85 12 34 52
I would like to obtain correlation of
A_y-A_p: 0.78
B_y-B_p: 0.88
C_y-C_p: 0.93
How can I do it in R? Is it possible?
This is really dangerous. Behavior of data.frames with invalid column names is undefined by the language definition. Duplicated column names are invalid.
You should restructure your input data. Anyway, here is an approach with your input data.
DF <- read.table(text = " A B C A B C
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30", header = TRUE, check.names = FALSE)
sapply(unique(names(DF)), function(s) do.call(cor, unname(DF[, names(DF) == s])))
# A B C
#0.9995544 0.1585501 -0.6004010
#compare:
cor(c(15, 30, 10), c(30, 56, 20))
#[1] 0.9995544
Here is another base R option
within(
rev(
stack(
Map(
function(x) do.call(cor, unname(x)),
split.default(df, unique(gsub("_.*", "", names(df))))
)
)
),
ind <- sapply(
ind,
function(x) {
paste0(grep(paste0("^", x), names(df), value = TRUE),
collapse = "-"
)
}
)
)
which gives
ind values
1 A_y-A_p 0.9995544
2 B_y-B_p 0.1585501
3 C_y-C_p -0.6004010
Data
df <- structure(list(A_y = c(15L, 30L, 10L), B_y = c(52L, 99L, 25L),
C_y = c(32L, 60L, 31L), A_p = c(30L, 56L, 20L), B_p = c(98L,
46L, 22L), C_p = c(56L, 25L, 30L)), class = "data.frame", row.names = c("1",
"2", "3"))

Column `project` must be a 1d atomic vector or a list

I have data frame called df ,I need to filter data from data frame using filter().pls check
my data frame
Queue project._id project.ProjectName project.Status project.CreatedBy project.Createdtime X.gender
first 111 Travel 1 manchi 2017-04-24 18:50:27 male
last 111 2334 1 mono 2017-04-24 18:50:27 Female
first 111 556 1 gunal 2017-04-24 18:50:27 male
first 7888 classical 1 manchi 2017-04-24 18:50:27 Female
I'm try to use dplyr to filter data By below code.
Finalfilter<-df%>%
filter(project.ProjectName == "Travel",Queue=="first")%>%
select(X.gender.)
my expected result is
Queue project._id project.ProjectName project.Status project.CreatedBy project.Createdtime X.gender
first 111 Travel 1 manchi 2017-04-24 18:50:27 male
first 111 556 1 gunal 2017-04-24 18:50:27 male
But i'm getting below ERROR ,help me to resolve this.
Error: Column project must be a 1d atomic vector or a list
dput.
structure(list(Queue = c("first", "last", "first", "first"),
project = structure(list(`_id` = c("111", "2334", "556",
"7888"), ProjectName = c("Travel", "HBussiness", "Travel",
"classical"), Status = c(1L, 1L, 1L, 1L), CreatedBy = c("manchi",
"mono", "gunal", "manchi"), Createdtime = structure(c(1493040027.826,
1493040027.826, 1493040027.826, 1493040027.826), class = c("POSIXct",
"POSIXt"))), .Names = c("_id", "ProjectName", "Status", "CreatedBy",
"Createdtime"), row.names = c(NA, 4L), class = "data.frame"),
X.gender. = c("male", "Female", "male", "Female")), .Names = c("Queue",
"project", "X.gender."), row.names = c(NA, 4L), class = "data.frame")
Your project.ProjectName is a data frame, not a vector, hence the error. A workaround can be,
df$project <- df$project$ProjectName
df%>%
filter(project == "Travel" & Queue=="first")%>%
select(X.gender.)
# X.gender.
#1 male
#2 male
If it doesn't work with dplyr another option is:
df[df$project.ProjectName == "Travel" & Queue=="first", ]

How to draw multiple lines in R under leaflet?

I am having trouble drawing multiple lines in R using leaflet. I have a base map of New York City stations. I would like to add more information from the existing data set. The data set has columns: start_lng, start_lat, end_lng end_lat and total_trip. For each row, I would like to draw a line that connects the start point and the end point separately. Then the two stations will be connect, which stands for a trip. I hope to have one trip for each row. Plus, for coloring, the darkness of the line segments will be based on the total_trip. How would I be able to do that? Thanks.
leaflet(sample) %>%
addTiles() %>%
setView(-73.9,40.7, zoom = 11) %>%
addCircles(data = master_stations,lng = ~long, lat = ~lat, weight = 1, popup = ~name)
Here's part of my data set:
start.station.id start.station.longitude start.station.latitude end.station.longitude end.station.latitude total_trip
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 72 -73.99393 40.76727 -74.00859 40.73620 2
2 72 -73.99393 40.76727 -73.99074 40.73455 2
3 72 -73.99393 40.76727 -73.97722 40.76341 2
4 72 -73.99393 40.76727 -73.98192 40.76527 2
5 79 -74.00667 40.71912 -73.98163 40.75206 2
6 79 -74.00667 40.71912 -73.98658 40.75514 2
7 79 -74.00667 40.71912 -73.98317 40.75527 2
8 79 -74.00667 40.71912 -73.98722 40.75300 2
9 83 -73.97632 40.68383 -73.97493 40.68981 4
10 83 -73.97632 40.68383 -73.98657 40.70149 2
# ... with 899 more rows
This is the full data set:
structure(list(start.station.id = c(72, 72, 72, 72, 79, 79),
end.station.id = c(238, 285, 352, 468, 153, 465), total_trip = c(2L,
2L, 2L, 2L, 2L, 2L), start.station.name = c("\"W 52 St & 11 Ave\"",
"\"W 52 St & 11 Ave\"", "\"W 52 St & 11 Ave\"", "\"W 52 St & 11 Ave\"",
"\"Franklin St & W Broadway\"", "\"Franklin St & W Broadway\""
), start.station.longitude = c(-73.99392888, -73.99392888,
-73.99392888, -73.99392888, -74.00666661, -74.00666661),
start.station.latitude = c(40.76727216, 40.76727216, 40.76727216,
40.76727216, 40.71911552, 40.71911552), end.station.name = c("\"Bank St & Washington St\"",
"\"Broadway & E 14 St\"", "\"W 56 St & 6 Ave\"", "\"Broadway & W 55 St\"",
"\"E 40 St & 5 Ave\"", "\"Broadway & W 41 St\""), end.station.longitude = c(-74.00859207,
-73.99074142, -73.97722479, -73.98192338, -73.9816324043,
-73.98658032), end.station.latitude = c(40.7361967, 40.73454567,
40.76340613, 40.7652654, 40.752062307, 40.75513557)), .Names = c("start.station.id",
"end.station.id", "total_trip", "start.station.name", "start.station.longitude",
"start.station.latitude", "end.station.name", "end.station.longitude",
"end.station.latitude"), row.names = c(NA, -6L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = list(start.station.id), drop = TRUE, indices = list(
0:3, 4:5), group_sizes = c(4L, 2L), biggest_group_size = 4L, labels = structure(list(
start.station.id = c(72, 79)), row.names = c(NA, -2L), class = "data.frame", vars = list(
start.station.id), drop = TRUE, .Names = "start.station.id"))

Deriving contingency table from a bigger contingency table in R

I have a csv format contingency table made by python language , like this:
case control
disease_A 20 30
disease_B 35 45
disease_C 42 52
disease_D 52 62
now i want to derive 2x2 contingency tables from this contingency table to calculate chi-square value using R
how can i derive a 2x2 table like the following from the contingency table above:
case control
disease_A 20 30
disease_D 52 62
Thats probably a novice question but im new to R and i couldn't find the solution anywhere else
Here's an approach.
The data:
txt <- " case control
disease_A 20 30
disease_B 35 45
disease_C 42 52
disease_D 52 62"
Read the data:
dat <- read.table(textConnection(txt))
# case control
# disease_A 20 30
# disease_B 35 45
# disease_C 42 52
# disease_D 52 62
Extract a subset of rows:
dat2 <- dat[rownames(dat) %in% c("disease_A", "disease_D"), ]
# case control
# disease_A 20 30
# disease_D 52 62
If M is of class table
M <- structure(c(20, 35, 42, 52, 30, 45, 52, 62), .Dim = c(4L, 2L), .Dimnames = list(
c("disease_A", "disease_B", "disease_C", "disease_D"), c("case",
"control")), class = "table")
xtabs(Freq~Var1+Var2,data= subset(as.data.frame(M,stringsAsFactors=F),
Var1%in% c("disease_A", "disease_D")))
Var2
Var1 case control
disease_A 20 30
disease_D 52 62
If M is a data.frame
M <- structure(list(case = c(20L, 35L, 42L, 52L), control = c(30L,
45L, 52L, 62L)), .Names = c("case", "control"), class = "data.frame", row.names = c("disease_A",
"disease_B", "disease_C", "disease_D"))
as.table(as.matrix(M[grep("A|D", rownames(M)),]))

Resources