Tutorials

  • Point Cloud Coding: the Status Quo
  • Video Summarization and Re-use Technologies and Tools
  • Immersive Imaging Technologies: from Capture to Display
  • Device Fingerprinting and its Applications in Multimedia Forensics and Security
  • Deep Bayesian Modeling and Learning

  • Point Cloud Coding: the Status Quo

    Contacts:

    • João Ascenso, Instituto Superior Técnico, Lisbon, Portugal
    • Fernando Pereira, Instituto Superior Técnico, Lisbon, Portugal

    Abstract:

    Recently, 3D visual representation models such as light fields and point clouds are becoming popular due to their capability to represent the real world in a more complete, realistic and immersive way, paving the road for new and more advanced visual experiences. The point cloud (PC) representation model is able to efficiently represent the surface of objects/scenes by means of a set of 3D points and associated attributes and is increasingly being used from autonomous cars to augmented reality. Emerging imaging sensors have made easier to perform richer and denser PC acquisitions, notably with millions of points, making impossible to store and transmit these very high amounts of data. This bottleneck has raised the need for efficient PC coding solutions that can offer immersive visual experiences and good quality of experience.
    This tutorial will survey the most relevant PC basics as well as the main PC coding solutions available today. Regarding the content of this tutorial is important to highlight: 1) a new classification taxonomy for PC coding solutions to more easily identify and abstract their differences, commonalities and relationships; 2) representative static and dynamic PC coding solutions available in the literature, such as octree, transform and graph based PC coding among others; 3) MPEG PC standard coding solutions which have been recently developed, notably Video-based Point Cloud Coding (V-PCC), for dynamic content, and Geometry-based Point Cloud Coding (G-PCC), for static and dynamically acquired content; 4) rate-distortion (RD) performance evaluation including the G-PCC and V-PCC standards and other relevant PC coding solutions, using suitable objective quality metrics. The tutorial will end with some discussion on the strengths and weaknesses of the current PC coding solutions as well as on future trends and directions.

    Speaker bios:

    João Ascenso is a professor at the department of Electrical and Computer Engineering of Instituto Superior Técnico and is with the Multimedia Signal Processing Group of Instituto de Telecomunicações, Lisbon, Portugal. João Ascenso received the E.E., M. Sc. and Ph.D in Electrical and Computer Engineering from Instituto Superior Técnico, in 1999, 2003 and 2010, respectively. In the past, he was an adjunct professor in Instituto Superior de Engenharia de Lisboa and Instituto Politécnico de Setúbal. He coordinates several national and international research projects, in the areas of coding, analysis and description of video. The last project grants received were in the field of point cloud coding and quality assessment. He is also very active in the ISO/IEC MPEG and JPEG standardization activities and currently chairs the JPEG-AI ad-hoc group that targets the evaluation and development of learning-based image compression solutions. He has published more than 100 papers in international conferences and journals and has more than 3200 citations over 35 papers (h-index of 25). He is an associate editor of IEEE Transactions on Multimedia, IEEE Transactions on Image Processing and was an associate editor of the IEEE Signal Processing Letters. He is an elected member of the IEEE Multimedia Signal Processing Technical Committee. He acts as a member of the Organizing Committees of well-known IEEE international conferences, such as MMSP 2020, ICME 2020, ISM 2018, QoMEX 2016, among others. He also served as a technical program committee member and area chair for several widely known conferences in the multimedia signal processing field, such as ICIP, MMSP and ICME and made invited talks and tutorials at conferences and workshops. He has received two Best Paper Awards at the 31st Picture Coding Symposium 2015, Cairns, Australia and at the IEEE International Conference on Multimedia and Expo 2019. He has also won the ‘Excellent Professor’ award from the Electrical and Computers Engineering Department of Instituto Superior Técnico several times. His current research interests include visual coding, quality assessment, light-fields, point clouds and holography processing, indexing and searching of audio–visual content and visual sensor networks.

    Fernando Pereira is currently with the Department of Electrical and Computers Engineering of Instituto Superior Técnico and with Instituto de Telecomunicações, Lisbon, Portugal. He is responsible for the participation of IST in many national and international research projects. He acts often as project evaluator and auditor for various organizations. He is Area Editor of the Signal Processing: Image Communication Journal and Associate Editor of the EURASIP Journal on Image and Video Processing, and is or has been a member of the Editorial Board of the Signal Processing Magazine, Associate Editor of IEEE Transactions of Circuits and Systems for Video Technology, IEEE Transactions on Image Processing, IEEE Transactions on Multimedia, and IEEE Signal Processing Magazine. In 2013-2015, he was the Editor-in-Chief of the IEEE Journal of Selected Topics in Signal Processing.He is or has been a member of the IEEE Signal Processing Society Technical Committees on Image, Video and Multidimensional Signal Processing, and Multimedia Signal Processing, and of the IEEE Circuits and Systems Society Technical Committees on Visual Signal Processing and Communications, and Multimedia Systems and Applications. He was an IEEE Distinguished Lecturer in 2005 and elected as an IEEE Fellow in 2008 for “contributions to object-based digital video representation technologies and standards”. He has been elected to serve on the Signal Processing Society Board of Governors in the capacity of Member-at-Large for a 2012 and a 2014-2016 term. Since January 2018, he is the SPS Vice-President for Conferences. Since 2013, he is also a EURASIP Fellow for “contributions to digital video representation technologies and standards”. He has been elected to serve on the European Signal Processing Society Board of Directors for a 2015-2018 term. Since 2015, he is also a IET Fellow. He is/has been a member of the Scientific and Program Committees of many international conferences and workshops. He has been the General Chair of the Picture Coding Symposium (PCS) in 2007, the Technical Program Co-Chair of the Int. Conference on Image Processing (ICIP) in 2010 and 2016, the Technical Program Chair of the International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) in 2008 and 2012, and the General Chair of the International Conference on Quality of Multimedia Experience (QoMEX) in 2016. He has been participating in the MPEG standardization activities, notably as the head of the Portuguese delegation, chairman of the MPEG Requirements Group, and chairman of many Ad Hoc Groups related to the MPEG-4 and MPEG-7 standards. Since February 2016, he is the JPEG Requirements Chair. He has been one of the key designers of the JPEG Pleno project which targets defining standard representations for several types of plenoptic imaging, notably light fields, point clouds and holograms. He has been developing research on point cloud clustering, coding and quality assessment, and publishing in these areas. He has contributed more than 250 papers in international journals, conferences and workshops, and made several tens of invited talks at conferences and workshops. His areas of interest are video analysis, coding, description and adaptation, and advanced multimedia services.



    Video Summarization and Re-use Technologies and Tools

    Contacts:

    • Vasileios Mezaris, CERTH-ITI, Greece
    • Lyndon Nixon, MODUL Technology, Austria

    Abstract:

    This tutorial will deliver a broad overview of the main technologies that enable the automatic generation of video summaries for re-use in different distribution channels, and the optimisation of the video summary-based reach and engagement of the audience; and, provide an in-depth analysis of selected SoA methods and tools on these topics. It will comprise two main modules. The first module, on video summaries generation, will provide an overview of deep-learning-based video summarization techniques, and then will discuss in depth a few selected SoA techniques that are based on Generative Adversarial Networks. Special emphasis will be put on unsupervised learning techniques, whose advantages will also be elaborated. An overview of video summarization datasets, evaluation protocols and related considerations & limitations will also be presented. The second module, on video summaries (re-)use and recommendation, will discuss the use of Web and
    social media analysis to detect topics in online content and trends in online discussion. It will subsequently examine the application of predictive analytics to suggest future trending topics, in order to guide video summaries publication strategies. Besides the underlying technologies, a few complete tools will be demonstrated, to link the research aspects of video summarization, trend detection and predictive analytics with the practitioners’ expectations and needs for video summarization and (re-)publication online. The tutorial’s target audience includes researchers in the video summarization and deep learning topics and, in general, in deep-learning-based multimedia understanding; researchers in web and social media data analysis, topic and trends detection, and predictive analytics; and practitioners in video content creation and (re-)use, including YouTube/Instagram prosumers, TV and film producers, representatives of broadcasters and online media platforms.

    Schedule:

    • Introduction and motivation (15 minutes)
    • Video summaries generation: deep-learning-based methods, datasets and evaluation protocols (75 minutes)
    • Break
    • Topic-driven summarization: tools for trending topics detection, prediction and video summarization (75 minutes)
    • Concluding remarks and discussion (15 minutes)

    Total tutorial time (excluding the break): 3h

    Speaker bios:

    Vasileios Mezaris is a Research Director (Senior Researcher Grade A) with the Information Technologies Institute / Centre for Research and Technology Hellas, Thessaloniki, Greece. His research interests include multimedia understanding and artificial intelligence; in particular, image and video analysis and annotation, machine learning and deep learning for multimedia understanding and big data analytics, multimedia indexing and retrieval, and applications of multimedia understanding and artificial intelligence in specific domains (including TV broadcasting and news, education and culture, medical / ecological / business data analysis). Dr. Mezaris has co-authored more than 40 papers in refereed journals, 20 book chapters, 150 papers in international conferences, and 3 patents. He has edited two books and several proceedings volumes; he serves as Associate Editor for the IEEE Signal Processing Letters (2016-present) and the IEEE Transactions on Multimedia (2012-2015, and 2018-present); and serves regularly as a guest editor for international journals, as an organizer or reviewer for conferences/workshops, and as a reviewer of research projects and project proposals for national and international funding agencies. He has participated in many research projects, and as the Coordinator in EC H2020 projects InVID and MOVING. He is a Senior Member of the IEEE.

    Lyndon J B Nixon is the CTO of MODUL Technology GmbH. He also holds the position of Assistant Professor in the New Media Technology group at MODUL University. He has been researching in the semantic multimedia domain since 2001. His PhD (2007) was on automatic generation of multimedia presentations using semantics. He has been active in many European and Austrian projects including in the role of Scientific Coordinator (LinkedTV) and Project Coordinator (ReTV, SOFI, MediaMixer, SmartReality, ConnectME). He is a proponent of “Linked Media” – ensuring rich semantic annotations of multimedia assets so that systems can derive associations between them for search, browsing, navigation or recommendation – and has co-organized a series of Linked Media workshops (WWW2013, ESWC2014, WWW2015, ESWC2016). These are among over 40 events he has co-chaired complemented by 27 talks, 8 book chapters, 6 journal articles and 88 refereed publications. Currently he focuses his research on content analysis of image and video in social networks, semantic annotation and linking of media fragments, and combining annotations and data analytics in prediction and recommendation for TV programming.



    Immersive Imaging Technologies: from Capture to Display

    Contacts:

    • Dr. Martin Alain – Trinity College Dublin, Ireland
    • Dr. Cagri Ozcinar – Trinity College Dublin, Irelan
    • Dr. Emin Zerman – Trinity College Dublin, Ireland

    Abstract:

    The advances in imaging technologies in the last decade brought a number of alternatives to the way we acquire and display visual information. These new imaging technologies are immersive as they provide the viewer with more information which either surrounds the viewer or helps the viewer to be immersed in this augmented representation. These immersive imaging technologies include light fields, omnidirectional images and videos, and volumetric (also known as free-viewpoint) videos. These different modalities cover the full spectrum of immersive imaging, from 3 degrees of freedom (DoF) to 6DoF, and can be used for virtual reality (VR) as well as augmented reality (AR). Applications of immersive imaging notably include education, cultural heritage, tele-immersion, remote collaboration, and communication. In this tutorial, we cover all stages of the immersive imaging technologies from content capture to display. The main concepts of immersive imaging will first be introduced, and creative experiments based on immersive imaging will be presented as a specific illustration of these technologies. Next, content acquisition based on single or multiple camera systems is presented, along with the corresponding data formats. Content coding is then discussed, notably ongoing standardisation efforts, followed by adaptive streaming strategies. Immersive imaging displays are then presented, as they play a crucial role in the user’s sense of immersion. Image rendering algorithms related to such displays are also explained. Finally, perception and quality evaluation of immersive imaging is presented.

    Keywords: immersive imaging, emerging media, light fields, omnidirectional videos, volumetric videos, 3DoF, 6DoF

    Schedule:

    The tutorial will follow the outline presented below:

    • Part I: Immersive Imaging Technologies (~20 minutes)
      • Immersion & Tele-Immersion
      • Different Imaging Modalities
      • Creative Experiments
    • Part II: Acquisition and Data Format (~40 minutes)
      • Single-camera systems
      • Multi-camera systems
    • Part III: Content Delivery (~40 minutes)
      • Coding
      • Adaptive Streaming
    • Part IV: Rendering and Display Technologies (~40 minutes)
      • Immersive imaging on 2D screens
      • HMDs for VR
      • HMDs for AR
    • Part V: Perception & Quality Evaluation (~40 minutes)
      • Visual Perception
      • Visual Attention
      • Quality Assessment

    URL to Tutorial Website:

    https://v-sense.scss.tcd.ie/lectures/tutorial-on-immersive-imaging-technologies/

    Speaker bios:

    Dr. Martin Alain received the Master’s degree in electrical engineering from the Bordeaux Graduate School of Engineering (ENSEIRB-MATMECA), Bordeaux, France in 2012 and the PhD degree in signal processing and telecommunications from University of Rennes 1, Rennes, France in 2016. As a PhD student working in Technicolor and INRIA in Rennes, France, he explored novel image and video compression algorithms.
    Since September 2016, he is a postdoctoral researcher in the V-SENSE project at the School of Computer Science and Statistics in Trinity College Dublin, Ireland. His research interests lie at the intersection of signal and image processing, computer vision, and computer graphics. His current topic involves light field imaging, with a focus on denoising, super-resolution, compression, scene reconstruction, and rendering.
    Martin is a reviewer for the Irish Machine Vision and Image Processing conference, IEEE International Conference on Image Processing, IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems I, and IEEE Transactions on Circuits and Systems for Video Technology. He was special session chair at the EUSIPCO 2018 in Rome, ICIP 2019 in Taipei, and ICME 2020 in London.

    Dr. Cagri Ozcinar is a research fellow within the V-SENSE project at Trinity College Dublin, Ireland, since July 2016. Before he joined the V-SENSE team, he was a post-doctoral fellow in the Multimedia group at Institut Mines-Telecom Telecom ParisTech, Paris, France.
    Cagri received the M.Sc. (Hons.) and the Ph.D. degrees in electronic engineering from the University of Surrey, UK, in 2010 and 2015, respectively. His current research interests include visual attention (saliency), coding, streaming, and computer vision for immersive audio-visual technologies.
    Cagri has been serving as a reviewer for a number of journal and conference proceedings, such as IEEE TIP, IEEE TCSVT, IEEE TMM, IEEE Journal of STSP, CVPR, IEEE ICASSP, IEEE ICIP, IEEE QoMEX, IEEE MMSP, EUSIPCO, and BMVC. Cagri has been involved in organizing workshops, challenges, and special sessions. He was a special session chair on recent advances in immersive imaging technologies at the EUSIPCO 2018, ICIP 2019, and ICME 2020.

    Dr. Emin Zerman is a postdoctoral research fellow in V-SENSE project at the School of Computer Science and Statistics, Trinity College Dublin, Ireland since February 2018. He received his Ph.D. degree (2018) in Signals and Images from Télécom ParisTech, France, and his M.Sc. degree (2013) and B.Sc. degree (2011) in Electrical and Electronics Engineering from the Middle East Technical University, Turkey. His research interests include image and video processing, immersive multimedia applications, human visual perception, high dynamic range imaging, and multimedia quality assessment.Emin is a member of Institute of Electrical and Electronics Engineers (IEEE) and IEEE Signal Processing Society. He has been acting as a reviewer for several conferences and peer-reviewed journals, including Signal Processing: Image Communications, IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), IEEE Transactions on Image Processing (TIP), MDPI Journal of Imaging, MDPI Applied Sciences, IEEE International Workshop on Multimedia Signal Processing (MMSP), European Signal Processing Conference (EUSIPCO), and IEEE International Conference on Image Processing (ICIP). He is one of the special session organisers at the ICME 2020 in London.



    Device Fingerprinting and its Applications in Multimedia Forensics and Security

    Contacts:

    • Chang-Tsun Li, Deakin University, Australia

    Abstract:

    Similar to people identification through human fingerprint analysis, multimedia forensics and security assurance through device fingerprint analysis have attracted much attention amongst scientists, practitioners and law enforcement agencies around the world in the past decade. Device information, such as device models and serial numbers, stored in the EXIF are useful for identifying the devices responsible for the creation of the images and videos in question. However, stored separately from the content, the metadata in the EXIF can be removed and manipulated at ease. Device fingerprints deposited in the content by the devices provide a more reliable alternative to aid forensic investigations and multimedia assurance. Various hardware or software components of the imaging devices leave model or device specific artifacts in the content in the digital image acquisition process. These model or device specific artifacts, if properly extracted, can be used as device fingerprints to identify the source devices. This tutorial will start with an introduction to various types of device fingerprints. The presentation will then focus on sensor pattern noise, which is currently the only form of device fingerprint that can differentiate individual devices of the same model. We will also discuss the real-world applications of sensor pattern noise to source device verification, common source inference, source device identification, content authentication (including fake new detection) and source-oriented image clustering. Some real-world use cases in the law enforcement community will also be presented. Finally we will discuss the limitations of existing device fingerprints and point out a few lines for future investigations including the use of deep learning to inference device fingerprints.

    Schedule:

    • Device Fingerprints (50 minutes)
      • Lens Aberrations
      • Colour filter array and colour interpolation artefacts
      • Camera response function
      • Quantisation table of JPEG compression
      • Sensor pattern noise
    • Sensor Pattern Noise Extraction and Enhancement (60 minutes)
      • Sensor pattern noise extraction
      • Sensor pattern noise enhancement
    • Break (20 minutes)
    • SPN in Multimedia Forensic Applications (50 minutes)
      • Source device verification
      • Common source inference
      • Source device identification
      • Content authentication (including fake news detection)
      • Source-oriented image clustering
    • Conclusions and Future Works (20 minutes)
      • Conclusions
      • Issues surrounding existing device fingerprints
      • Future works (including the use of deep learning)

    Speaker bio:

    Chang-Tsun Li received the BSc degree in electrical engineering from National Defence University (NDU), Taiwan, in 1987, the MSc degree in computer science from U.S. Naval Postgraduate School, USA, in 1992, and the PhD degree in computer science from the University of Warwick, UK, in 1998. He was an associate professor of the Department of Electrical Engineering at NDU during 1998-2002 and a visiting professor of the Department of Computer Science at U.S. Naval Postgraduate School in the second half of 2001. He was a professor of the Department of Computer Science at the University of Warwick, UK, until Dec 2016. He was a professor of the School of Computing and Mathematics, and Director of Data Science Research Unit, Charles Sturt University, Australia from January 2017 to February 2019. He is currently Professor of Cyber Security of the School of Information Technology at Deakin University, Australia and Research Director of Deakin’s Centre for Cyber Security research and Innovation. His research interests include multimedia forensics and security, biometrics, data mining, machine learning, data analytics, computer vision, image processing, pattern recognition, bioinformatics, and content-based image retrieval. The outcomes of his multimedia forensics research have been translated into award-winning commercial products protected by a series of international patents and have been used by a number of law enforcement agencies, national security institutions, courts of law, banks and companies around the world. He is currently Associate Editor of IEEE Access, the EURASIP Journal of Image and Video Processing (JIVP) and Associate of Editor of IET Biometrics. He has published over 200 papers in prestigious international journals and conference proceedings, including a winner of 2018 IEEE AVSS Best Paper Award. He contributed actively in the organisation of many international conferences and workshops and also served as member of the international program committees for numerous international conferences. He is also actively disseminating his research outcomes through keynote speeches, tutorials and talks at various international events.



    Deep Bayesian Modeling and Learning

    Contacts:

    • Jen-Tzung Chien, National Chiao Tung University, Taiwan

    Abstract:

    This tutorial addresses the advances in deep Bayesian learning for spatial and temporal data which are ubiquitous in speech, music, text, image, video, web, communication and networking applications. Multimedia contents are analyzed and represented to fulfill a variety of tasks ranging from classification, synthesis, generation, segmentation, dialogue, search, recommendation, summarization, answering, captioning, mining, translation, adaptation to name a few. Traditionally, “deep learning” is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model. The “latent semantic structure” in words, sentences, images, actions, documents or videos learned from data may not be well expressed or correctly optimized in mathematical logic or computer programs. The “distribution function” in discrete or continuous latent variable model for spatial and temporal sequences may not be properly decomposed or estimated. This tutorial addresses the fundamentals of statistical models and neural networks, and focuses on a series of advanced Bayesian models and deep models including Bayesian nonparametrics, recurrent neural network, sequence-to-sequence model, variational auto-encoder (VAE), generative adversarial network, attention mechanism, memory-augmented neural network, skip neural network, temporal difference VAE, stochastic neural network, stochastic temporal convolutional network, predictive state neural network, and policy neural network. Enhancing the prior/posterior representation is addressed. We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in sequence data. The variational inference and sampling method are formulated to tackle the optimization for complicated models. The embeddings, clustering or co-clustering of words, sentences or objects are merged with linguistic and semantic constraints. A series of case studies are presented to tackle different issues in deep Bayesian modeling and learning. At last, we will point out a number of directions and outlooks for future studies.

    Schedule:

    • Introduction and motivation (30 minutes)
    • Bayesian learning (30 minutes)
    • Deep spatial and temporal modeling (60 minutes)
    • Deep Bayesian learning (60 minutes)

    Total tutorial time (excluding the break): 3 hours

    Speaker bio:

    Jen-Tzung Chien is the Chair Professor at the National Chiao Tung University, Taiwan. He held the Visiting Professor position at the IBM T. J. Watson Research Center, Yorktown Heights, NY, in 2010. His research interests include machine learning, deep learning, computer vision and natural language processing. Dr. Chien served as the associate editor of the IEEE Signal Processing Letters in 2008-2011, the general co-chair of the IEEE International Workshop on Machine Learning for Signal Processing in 2017, and the tutorial speaker of the ICASSP in 2012, 2015, 2017, the INTERSPEECH in 2013, 2016, the COLING in 2018, the AAAI, ACL, KDD, IJCAI in 2019. He received the Best Paper Award of IEEE Automatic Speech Recognition and Understanding Workshop in 2011 and the AAPM Farrington Daniels Award in 2018. He has published extensively, including the books “Bayesian Speech and Language Processing”, Cambridge University Press, in 2015, and “Source Separation and Machine Learning”, Academic Press, in 2018. He is currently serving as an elected member of the IEEE Machine Learning for Signal Processing Technical Committee.