Split string in sliding window in a dataframe

Split string in sliding window in a dataframe - r

I have previously asked for a method to split a string each 3 characters and save the results in a dataframe. Now I want to do the same thing but instead in a sliding window of size n.
This question differs from the marked duplicate one as the results here should be outputed in a dataframe. The mapply function given would require quite some extra work to combine it in a new dataframe and to add the positions as column names as explained at the top of my previous question .
Example data
df <- data.frame(id = 1:2, seq = c('ABCDEF', 'XYZZZY'))
Looks like this:
id seq
1 1 ABCDEF
2 2 XYZZZY
Splitting on every third character with a window size of n = 1
id 1 2 3 4
1 ABC BCD CDE DEF
2 XYZ YZZ ZZZ ZZY
I tried to do this using the seperate function as answered on my previous post however as far as I can find this can only split on fixed split points rather than on a range.

Related

Count multiple Data in a string cell

I would count with the func table() in R how many time a value occures in a cell. But, some cell contains more value divided by colon. I report an example below:
example <- data.frame(c("A","B","A:::B"))
table(example)
the result is:
A A:::B B
1 1 1
but i want something like this
A B
2 2
I try to duplicate the rows with this characteristics, but the dataset is already too large and duplicate rows makes dataset impossible to use. How can i do?
thanks

We can split the column values by ::: and get the table
table(unlist(strsplit(example[[1]], "\\:+")))
# A B
# 2 2

Twofold, consecutive row selecting starting at different rows in R

I have got the following problem. I have a data.frame with an x and y column representing some points in space:
X<-c(18.25743,18.25783,18.25823,18.25850,18.25863,18.25878,
18.25885,18.25912,18.25943,18.25962,18.25978,18.26000,
18.26022,18.26051,18.26070,18.26095,18.26118,18.26140,
18.26189,18.26250,18.26310,18.26390)
Y<-c(44.69561,44.69564,44.69567,44.69567,44.69586,
44.69600,44.69637,44.69671,44.69691,44.69701,44.69720,
44.69740,44.69763,44.69774,44.69787,44.69790,44.69791,
44.69795,44.69812,44.69802,44.69812,44.69834)
eDF<-data.frame(X,Y)
Now my problem is they are "sorted" wrong for plotting.So what I need is a function to write together the rows of the two points which belong together (in a list of lists):
1 and 12 is ID1
2 and 13 is ID2
3 and 14 is ID3
...
11 and 22 is ID11
Every so created list within the list of lists should have its unique ID (just numerating from 1 to the end). Well because I got this problem in all my data with different length.
It would be great if the starting point of the second consecutive row selecting (the 12) is flexible always taking the first row after half of the data.((rownumber/2)+1) in this example
12.
Well I have tried some things and i think Im on the right way but I cant figure out a solution by myself.
This function is pretty near but i cant manage to make it start at different rows(1 and 12):
lapply(2:nrow(eDF), function(x) eDF[(x-1):x,])
I also tried to figure it out with seq and it would do what i need if i could make a list of lists by connecting both code samples. Well I also need to change the concrete start and end numbers to a dynamic solution.
eDF[(seq(1,to=11,by=1)),] # selecting rows 1 to 11
eDF[(seq(12,to=nrow(eDF),by=1)),] #selecting rows 12 to end
Anyone any ideas?

I don't know if you needed an ID column inside of the new list but another way would be:
#create the IDs
eDF$ID <- rep(1:11,2)
#split the data.frame according to those
mylist <- split(eDF, eDF$ID)
Output:
mylist
$`1`
X Y ID
1 18.25743 44.69561 1
12 18.26000 44.69740 1
$`2`
X Y ID
2 18.25783 44.69564 2
13 18.26022 44.69763 2
$`3`
X Y ID
3 18.25823 44.69567 3
14 18.26051 44.69774 3
$`4`
X Y ID
4 18.2585 44.69567 4
15 18.2607 44.69787 4
#and so on...
You could only do split(eDF, rep(1:11,2) if you don't need the ID column.

We can modify the OP's lapply code
lapply(1:11, function(i) eDF[c(i, i+11),])

finding "almost" duplicates indices in a data table and calculate the delta

i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)

The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.

There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.

Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA

Calling on a column from a data frame within a data frame

I have a list of data frame (lets call that "data") that I have generated which goes something like this:
$"something.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
$"something else.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
I would like to output from the table "something.csv" one number within column x.
So far I have used:
data$"something.csv"$x[2]
This coding works and I am happy that it does, but my problem is that I want to automate this process and so i have put all the table titles into a list (filename) which goes:
[1] "something.csv", "something else.csv"
So i made a for loop which should allow me to do so but when I put in:
data$filename[1]$x[2]
it gives me back NULL.
When i print filename[1], I get [1] "something.csv" and if I type
data$"something.csv"$x[2]
I get the result I want. so if filename[1] = "something.csv", why does it not give me the same results?
I just want my code to out put the second row of column x and automate by using filename[i] in a for loop.

The way you have tried to approach the problem tries to find a column 'filename[1]' from the list, but it is not found. Hence, the NULL gets returned.
You need to use square brackets, and subset the object data. Here's an example:
# Generate data
data<-vector("list", 2)
names(data)<-c("something.csv", "something else.csv")
data[[1]]<-data.frame(x=1:3, y=1:3, z=1:3)
data[[2]]<-data.frame(x=1:3, y=1:3, z=1:3)
filename<-names(l)
# Subset the data
# The first data frame, notice the square brackets for subsetting lists!
data[[filename[1]]]
# column x
data[[filename[1]]]$x
# Second observation of x
data[[filename[1]]]$x[2]
The above uses for subsetting the names of the objects in the list. You can also use the number-based subsetting suggested by #Jeremy.

you can also use [ and [[ to call data$"something.csv"$x[2] try
data[[1]][2,1]
where [[1]] is the first list element and [2,1] is the data frame reference element

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help

apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex